Being on-call for software can suck. It seems like everyone has a horror story of being woken up at 3am for an outage in their service, and then having to work until daylight to get things working again. I have been on an on-call rotation for about two years now and it has gone very well. We typically get paged during off hours once per week and resolving a page usually takes under three hours. Here are some thoughts on what makes an on-call experience great.
The biggest thing in my opinion is that the page frequency is on the lower end at 1-2 off hours pages per week. Off hours just means outside of the typical business hours of 9-5. A wake up would be getting paged when you were asleep in the evening. Having a low rate of pages is important because it means your on-call can get adequate rest during the week. In the worst case scenario where your on-call gets woken up twice during a week long on-call shift, they will still be operating reasonably well by week end. Page frequency is especially important if you on-call rotation is smaller.
Oncall rotation size is important. You need people to spread the load around. I’ve worked in rotations ranging from 3-11 people and 5-10 is the sweet spot. At that size you have about a month off between on-call shifts. Beyond 10 people and you will start to get rusty since people are only on-call about every 2 months. Having more people also makes it easier to support vacations without anyone feeling like they didn’t get a break from being on-call. Smaller rotations are bad because the engineers on the team will not get enough time between shifts to complete project work. Feature development will stall and your team basically becomes an ops team. Additionally, the bus factor is too low in a small rotation, if one person goes on vacation and the other has a power outage you might ended up with no one to respond to a page.
On my team we have a clear list of things we do in response to a page. Anything else will be left for business hours.
The things we do are;
1. Scale up or down the fleet
2. Turn on or off a feature toggle
3. Rollback a deployment
4*. Rollforward a fix
Note that rolling forward or patching production are the last resort. Making a code change is the slowest way to address an outage and the highest risk. Whenever possible you want to make code changes during normal office hours.
Having good runbooks reduces the cognitive load when dealing with a service outage. They can also save significant amounts of time to fix common problems just by having the steps taken to fix last time recorded. In your regular on-call shift review meeting its best to add new entries to the runbook to cover pages during that week
How to know your oncall team is in a good place?
- If it is easy for people on the team to get someone to cover for them when they go on vacation.
- People on the team volunteer to be oncall for peak events
- People don’t complain about being oncall during 1 on 1 reviews