There is one thing engineers hate the most—waking up at night and troubleshooting production incidents. What if I told you I know how to make it less painful?
Nobody wants to be on–call. Back in the old days of desktop applications, the on–call duty was reserved mainly to network and infrastructure engineers, who were responsible to make the internet work at all times. In today's world, where every single app—is a SaaS served via the only operating system you will ever need, the browser—on–call duty has been propagated towards software engineers as well.
It doesn’t matter if you are a frontend or a backend engineer, if you are junior or senior—chances are you will, eventually, find yourself as part of an on–call rotation.
What it means to be paged?
Before I move on to your broken alerts, let’s set the record clear on the meaning of being paged. Paging an engineer means that your system performance has been degraded. This might mean a total degradation—like inability of users to access your system, or it might mean partial degradation—like inability to send registration emails.
From a business perspective, this usually means that your business is losing money. Every time your customers are experiencing service degradation, they will demand compensation in accordance with your SLA. Inability to acquire new customers, means they will switch to a different service—hence depriving you of their money.
From an engineering perspective, this means an immense pressure. During an on–call page, the number one priority of the paged engineer is to get the system back and running as fast as possible. This is not a time for nice solutions, or a major refactor. This is a time when you need to move fast, and with confidence. Keep that in mind, we will circle back to this later.
Why you are being paged?
If we look at all the reasons for paging an on–call engineer, we can group them in two main categories: manual and automatic.
Manual paging is self-explanatory. Your customer calls the customer support and notifies of a problem she is having. The customer support representative usually goes through a series of predefined steps, as he lacks the technical knowledge of the internals of the system, to either mitigate, or escalate the issue. After failing to mitigate the issue, and upon realization that the problem is critical, he might decide to page the on–call engineer. Depending on the nature of the organization, the customer support rep might page a SOC engineer, who will in turn might escalate to either infrastructure or a software engineer. In such organizations, software engineers are typically the last in line to be paged. In a smaller, less defined organizations, customer support might page the engineer directly.
With manual paging, there is not a lot to do. Each case is usually unique and requires a thorough investigation by the paged engineer to mitigate the issue. What can be improved, however—are automatic pages.
Automatic paging is triggered by your monitoring system. With proper monitoring in–place, you can define alerts that will be triggered upon different events. For example: an alert that triggers when the amount of free disk spaces reaches below certain point, indicating a potential failure due to inability to store data. Or an alert that monitors the RAM usage of a machine and triggers at a specific threshold, which might indicate a potential memory leak which will lead to an out of memory exception, and eventually—a painful death of your process by the operating system.
And from my experience, automatic paging is where most people, and alerting systems—get it wrong.
The missing ingredients from your monitoring alerts
During an on–call trigger, the paged engineer experiences a context switch. Mental context switch, just like the context switch done by OS schedulers, is an extremely expensive and productivity taxing operation for the human brain. Combined with the fact that from the moment you’ve been paged—your company starts to, potentially, lose money—it’s a terrible situation to be in.
The last thing you want to do under such circumstances—is to start understanding what happened. At the moment you’ve been paged, you already should know what’s the issue. And so, the first missing ingredient is—clarity.
Have clarity in your alerts
Imagine the following PagerDuty alert: Database disk reached 90% usage. How clear is this alert? The only information I can gather from the alert itself—is that disk space is running out. Here are some of the questions that I have no answers for:
- What database is it? (I could have multiple)
- What region? (I could have multiple)
- How fast is it filling up?
By getting the answers to the first 2 questions, I eliminate the unnecessary time spent by browsing the terrible AWS console for all my databases, in an attempt to find the affected one.
By knowing the rate of filling, I know what are my time constraints. If I’m going to run out of space in 7 days, maybe I should not be paged at 2AM. However, if there is an automatic process that got stuck, and in about 1 hours I’m going to run out of space—I better hurry up before my entire database crashes.
A good alert should have clarity. It should point you to the alerting resource, give you time constraints, and if possible direct you towards the offending service or sub system. For example: [us-east-1] Database
pikachu is filling at a rate of 1 GB per hour and will run out of space in ~1 hour.
I know that I have 1 hour max. I know the exact resource and where it’s located. This automatically might help me identify the potential issue, without even looking at the code. I might remember, out of the blue, the Jake pushed a new cron earlier today, that writes to this database. And so this might be a potential vector of investigation. This gives me a starting point, as opposed to blindly browsing AWS for all my databases, in all my regions, to find the offending database, while wasting my sleep hours as well as precious time I’ve been given to resolve the issue.
However, I still lack one major ingredient.
Aviation is one of the safest forms of transportation. There are 3 ingredients that make it safe:
- Well trained pilots
- A lot of automation
- A checklist
If you’ve ever looked out of the window of an airplane, just before takeoff, you probably noticed that all the flaps are suddenly going up and down, back and forth. The reason for that—pilots are running a pre–takeoff checklist. It doesn’t matter if you are a novice pilot, or a veteran one, before every takeoff—you run a checklist to check the critical systems of the airplane.
For some mysterious reason, we, engineers—don’t do that. Imagine the following alert: Subsequent requests to service
xyz failed continuously for the past 15 minutes. This alert monitors the status of some
xyz service, by issuing requests to it, and now indicate that those requests have failed. For the sake of example, let’s assume that
xyz is an external service you depend on. It could be an email provider, or an API that gives you weather information.
One potential approach would be to go and start looking at the logs. Chances are, if this alert is infrequent, you won’t remember the exact log index or query you need to execute. And so you will try to fish the right query, to understand why requests to this service—fail. After eventually finding the correct log entry, and seeing that the error code is 503, you now start to defragment your fragile memory, and the neurons in your brain are starting to catch fire—while you try to remember who is the relevant person to reach out to in order to verify whether
xyz is not having any issues on their side. After aging faster than you should have, you open a support ticket to
xyz customer support, and they notify you—yeah, we have some issues on our side.
Humans are imperfect. And even though we—engineers, work with the same system every day, just like pilots who fly every day, we might make mistakes, just like pilots. But pilots have checklists. And instead of remembering a bunch of pre-takeoff procedures, they just need to remember one—run the damn checklist!
This entire issue could have been resolved if we had a checklist like this:
- Identify the error code returned from
xyz by following [this link] to the logs and looking for the field
- If the code is
503—open a support ticket, using [this form] and verify whether
xyz is having problems on their side
Not only this guides you step–by–step on how to identify, and possibly resolve the issue—it also functions as future documentation for new engineers that enter your on–call schedule.
I don’t know the best way to manage those checklists. I tend to hate Wiki pages as they become a pile of unsearchable garbage, but if you are able to keep a clean Wiki—it could work for you. You can have it pinned in your teams’ Slack channel. If you have other creative solutions, please feel free to share them with me (contact details are at the bottom of this post).
The best place would be to have it inside the alert itself. Once I get paged, I just go to the alert and can see the checklist there. Moreover, I can edit and fine tune the alert checklist as I go, or after I’ve finished with the incident. I was surprised that no monitoring system offers such capabilities.
Remember—incidents and pagers are stressful events, and people are imperfect. We tend to forget how things work, even if we work with them on a daily basis, and especially under stress. By eliminating as much decision–making during the incidents, as possible, we can create a better experience and faster resolution of production incidents.
I use checklists a lot in my life. This blog post, and all others I write, follow an always–evolving checklist during all the publication stages. If you are interested to know how I manage my writing—drop me a comment in one of the contact channels below. Until next time.