How to make oncall great on your team.

Being on-call for software can suck. It seems like everyone has a horror story of being woken up at 3am for an outage in their service, and then having to work until daylight to get things working again. I have been on an on-call rotation for about two years now and it has gone very well. We typically get paged during off hours once per week and resolving a page usually takes under three hours. Here are some thoughts on what makes an on-call experience great. 

Page Frequency 

The biggest thing in my opinion is that the page frequency is on the lower end at 1-2 off hours pages per week. Off hours just means outside of the typical business hours of 9-5. A wake up would be getting paged when you were asleep in the evening. Having a low rate of pages is important because it means your on-call can get adequate rest during the week. In the worst case scenario where your on-call gets woken up twice during a week long on-call shift, they will still be operating reasonably well by week end. Page frequency is especially important if you on-call rotation is smaller. 

Rotation size

Oncall rotation size is important. You need people to spread the load around. I’ve worked in rotations ranging from 3-11 people and 5-10 is the sweet spot. At that size you have about a month off between on-call shifts. Beyond 10 people and you will start to get rusty since people are only on-call about every 2 months. Having more people also makes it easier to support vacations without anyone feeling like they didn’t get a break from being on-call. Smaller rotations are bad because the engineers on the team will not get enough time between shifts to complete project work. Feature development will stall and your team basically becomes an ops team. Additionally, the bus factor is too low in a small rotation, if one person goes on vacation and the other has a power outage you might ended up with no one to respond to a page. 

Clear duties

On my team we have a clear list of things we do in response to a page. Anything else will be left for business hours. 

The things we do are;

1.  Scale up or down the fleet

2. Turn on or off a feature toggle

3. Rollback a deployment 

4*. Rollforward a fix

Note that rolling forward or patching production are the last resort. Making a code change is the slowest way to address an outage and the highest risk. Whenever possible you want to make code changes during normal office hours. 

Good Runbooks

Having good runbooks reduces the cognitive load when dealing with a service outage. They can also save significant amounts of time to fix common problems just by having the steps taken to fix last time recorded. In your regular on-call shift review meeting its best to add new entries to the runbook to cover pages during that week 

How to know your oncall team is in a good place?

  1. If it is easy for people on the team to get someone to cover for them when they go on vacation.
  2. People on the team volunteer to be oncall for peak events 
  3. People don’t complain about being oncall during 1 on 1 reviews

Software Leviathans and the weird dominance of good enough.

One day in Spring 1989, I was sitting out on the Lucid porch with some of the hackers, and someone asked me why I thought people believed C and Unix were better than Lisp. I jokingly answered, “because, well, worse is better.” We laughed over it for a while as I tried to make up an argument for why something clearly lousy could be good.

https://www.dreamsongs.com/WorseIsBetter.html

It has long been wondered why Java took the crown for the ‘enterprise’ language. I can’t really argue on that topic since I came onto the scene long after Java was all there was. This article is about why software leviathans are written in Java more than anything else. 

You have a huge software project to build. What language do you build it in? The prototype was written in ruby on rails by one guy and an Adderal prescription. Now they want you to scale this thing to a 1000+ engineers over 5 years of development. You might think “aha this is my chance, lets save an order of magnitude lines of code and use lisp”, except this story happened in the past and they chose Java. 

Why is it always Java? Sure it’s reasonably fast, but Facebook made PHP work, can’t we at least use Haskell? Since we have the benefit of hindsight, we know that most of the biggest software systems are built in Java. Google built so many leviathans in Java that they bankrolled a new language like Java but with less features. Amazon is based on Java. Netflix is java again. Facebook made their own language and Microsoft is old enough to have existed before Java, but still made their own version of Java, C#. 

The real question should be, “What is Java’s secret?”. 

Java requires a lot of boiler plate

Java just plays well with the major constraints in a software leviathan and at Leviathan scale that is all that matters. 

This one is the corollary of “Java doesn’t support meta-programming”. Creating your own DSL is great, 1000 engineers creating their own DSL is 999 nightmares. Software Leviathans are too big for any one engineering team to understand. Any DSL you create makes your code unintelligible to the rest of the people working in hell with you. I can understand boilerplate written by a monkey, but a DSL written by another software engineer could take me days to understand. When your team gets poached to go work on a startup where the code base isn’t humongous, it’s a lot easier to bring in Java programmers to replace you lot than it would be to get Haskell engineers to figure out your undocumented dialect. 

Google got to the point where they figured Java had too much meta-programming ability so they created Go which is basically Java without inheritance. That is what happens when you work in a leviathan project. You begin to resent the ability of your peers to do anything unusual, because you know it’s just going to be more work for you. 

Adding more onboarding time to understand 1) the functional language and 2) the DSL your team created might push our already long 6 month on-boarding period closer to the 1 year mark. I wrote an article about onboarding time and functional languages aimed at startups, but honestly I don’t think the hiring market is the real reason Java dominates the top end. FAANG is already willing to train new grads to work on their giant software projects. 

It boils down to comprehension honestly. Humans can only comprehend so many things and at leviathan scale the max is a tiny fraction of the entire system. 

In a software leviathan your team constantly works with other teams’ systems. How does this API work? There isn’t any documentation and one 30 minute office hours isn’t going to explain that hair ball. If you all use the same language and that language is Java there is a chance you can open up their code base and figure out what is going on. They probably didn’t do anything you wouldn’t expect like pre-allocating all of their memory and storing all objects into a ring buffer. But if they did do something crazy you can probably figure it out. Besides Java doesn’t have anything like Scalaz so you won’t be surprised by a functor where you weren’t expecting it. 

Lets take the opposite side, away team work. You have been given the glorious task of implementing a new feature. But it’s impossible to do it cleanly without an API change in another team’s system. That team fully supports the change and has contributed 2 paragraphs to your architecture document describing the change to make in their system. But the change isn’t on their team’s roadmap so you are going to have to do it. 

Getting their service to run and pass integration tests in your virtual development machine takes a week. Now you need to navigate their system where they have conveniently used dependency injection to ensure that you can’t know which of the 5 implementations of this interface is in play. Do you still wish the other team could use Clojure? You might never figure out the DSL. 

Have you ever looked at somebody else’s Lisp code and wondered what was inside the variables? Now imagine this is your job and you will spend the next month making a 200 line change to a 100,000 line of code API service you didn’t know existed until this week. Except this will happen every quarter for the rest of your career. 

People complain about how Java forces you to write the type of things everywhere, but for software leviathans this is a benefit. I can see helpful type signatures everywhere, whether I’m reading your code in my IDE, an email, an excerpt in an arch doc, or in a Slack message you sent me at 3 am. 

Java and Go are great in Software Leviathans. You don’t have to worry about stumbling upon a programming mystery created 10 years ago by a disgruntled new grad. You can expect a consistent syntax and language whichever microservice you are working on. The code has self-documenting types that are ‘easy’ to understand. Honestly, they are a lot of benefits which make a tough coding environment a little more manageable.

Software Leviathans

Dis-economies of scale, why FAANG pays high salaries, the dominance of Java

The top end of software engineering jobs are dominated by what I’ve started thinking of as ‘Software Leviathans’, large software systems that are staffed by thousands of engineers. A few that come to mind are Amazon Alexa, Amazon.com, Google Search, Salesforce, Facebook.com. These are not “monoliths’ or large services that do everything. Instead they are the result of combining 100s of smaller ‘micro-services’ into one massive software product. 

These leviathans do many many things, few people on the planet can claim to know all of the features of facebook.com. It is quite possible that there exists no single list that enumerates every feature in that product. 

Similarly, development on these systems happens in parallel across many teams. It it is essentially impossible for any one person to keep track of everything that is being added to the system. 

Leviathans are too big for anyone to understand. It doesn’t matter what architecture or runtime choices are made. It could be one massive JVM, a million lambda functions, a hundred thousand docker containers or thousands of micro-services. Even if you work on the leviathan, you won’t have any real understanding of the total state of the system. Each engineer will be aware of and communicate with a tiny fraction of the total number of people working inside the leviathan. 

Leviathans are heterogeneous systems. The do not do ‘one thing well’. Leviathans do everything you can think of. Google.com is a search engine, but it’s also a calculator, an advertising system, a web scraper, a hotel booking tool, a flight booking tool, and many more. Leviathans grow in parallel, across myriad tentacles of functionality. New features emerge all the time usually to the surprise of other engineers on the project. 

Leviathans are difficult to work in. Despite appearing to be a sea of constant change from the outside. Any change made inside the Leviathan is extremely expensive in engineering hours. There are thousands of potential interactions each engineering team has to consider when evaluating changes to their system. The architecture must be constrained heavily to support parallel development in environments where coordination between different teams is impossible due to scale. Engineers working on a software leviathan spend a relatively small fraction of their time actually writing code as compared to debugging issues, research, coordinating changes, and documenting. 

Leviathans are interesting because they are the ‘core’ services powering the digital world these days. Their scale is at top of the chart in the software engineering world and as a result they expose the limitations of software engineering. 

Software diseconomies of scale are at their most evident in these software leviathans. They are massive projects with huge numbers of the best engineers working on them. But development is slow per engineer and code quality is not clearly superior to industry best practices. 

Don’t use internal tooling, contribute to your tools.

There is a dichotomy in software engineering organizations, some only use public tooling, entirely avoiding building their owns tools. Other companies  follow the ‘Not invented here’ principle and try to only use software developed internally. 

There are two forces driving this split, first building software tooling is expensive, small companies often cannot afford to build and support internal tools. It is an expensive recurring cost that can easily get out of hand. Building 2 internal tools a year will set you on a path to supporting 20 tools 10 years from now eating huge chunks of your operations budget. 

The other force in play is that if you use public or off the shelf tooling you will encounter workflow discontinuities that are difficult to fix. Using off-the-shelf tool A with tool B might require an entire employee to bridge the gaps manually, while still be a pain for everyone involved. Decision makers at some companies think to themselves, “We can just build a tool B that works perfectly with our use case.” And that works great when you have one tool, but then when need C comes along your team  makes the same case again, “We already invested in custom tool B we can’t throw away that work, we need a new custom tool for need C”. And now your company is on the path towards building an alphabet’s worth of internal tools that aren’t useful outside of your business.

Luckily, there is a solution to this. Use Open Source tooling and when you run into a workflow problem, work with the maintainers to contribute a fix. Even in poorly managed projects extending the code to support your use case will be less costly than building an entirely new internal tool to solve the same problem. 

Software is a depreciating Good

“If customers are paying for the work, presumably by the hour, then you can’t expect them to pay for weeks (or more) of work every few years just to arrive back at the same product they started with.”
– From reddit user /u/PragmaticFinance

People have the idea that because the software has the same features as it did five years ago, it still is just as ‘good’ as it was five years ago. And if we can keep running it without modifications that is a reasonable practice. Unfortunately, while it feels intuitive that is not the reality with software. 

Sorry, a software product that is built on out of date tooling is strictly inferior to software built on up-to-date tooling. It is not ‘just as good’ it is significantly worse. The out of date software has increased vulnerability to security defects. Python 2.7 doesn’t get security patches anymore, if your software is built on it and a zero day is discovered. Your software could be unoperable for weeks while engineers upgrade it to function on Python 3. 

Software that can’t be patched to address security vulnerabilities in a reasonable time frame is strictly worse than up to date software. Anyone who was evaluating a purchase of these two programs would consider the expense of making changes. 

Aside from security issues, out of date software is harder to modify. What are the banks going to do when the last Cobol programmer dies? They will have to fund Cobol bootcamps which is not going to be cheap.

What are companies that had ‘feature complete’ software that didn’t need to change in years doing now that GDPR is a thing? That software is getting dusted off and updated or entirely replaced.

Roads and tractors require constant maintenance just to keep doing the same job. Software is the same. Maintenance costs should be estimated and included in the total cost of ownership for software. 

A lot of business software just encodes business processes. It can get out of date because the world shifted and the process changed, or the world shifted and the way the process is encoded is out of date.

Each time I write one of these technical debt posts it helps me understand why Software As A Service took over in such a big way. If you needed this year’s model of tractor every year you would lease it.