Software Leviathans

Dis-economies of scale, why FAANG pays high salaries, the dominance of Java

The top end of software engineering jobs are dominated by what I’ve started thinking of as ‘Software Leviathans’, large software systems that are staffed by thousands of engineers. A few that come to mind are Amazon Alexa, Amazon.com, Google Search, Salesforce, Facebook.com. These are not “monoliths’ or large services that do everything. Instead they are the result of combining 100s of smaller ‘micro-services’ into one massive software product. 

These leviathans do many many things, few people on the planet can claim to know all of the features of facebook.com. It is quite possible that there exists no single list that enumerates every feature in that product. 

Similarly, development on these systems happens in parallel across many teams. It it is essentially impossible for any one person to keep track of everything that is being added to the system. 

Leviathans are too big for anyone to understand. It doesn’t matter what architecture or runtime choices are made. It could be one massive JVM, a million lambda functions, a hundred thousand docker containers or thousands of micro-services. Even if you work on the leviathan, you won’t have any real understanding of the total state of the system. Each engineer will be aware of and communicate with a tiny fraction of the total number of people working inside the leviathan. 

Leviathans are heterogeneous systems. The do not do ‘one thing well’. Leviathans do everything you can think of. Google.com is a search engine, but it’s also a calculator, an advertising system, a web scraper, a hotel booking tool, a flight booking tool, and many more. Leviathans grow in parallel, across myriad tentacles of functionality. New features emerge all the time usually to the surprise of other engineers on the project. 

Leviathans are difficult to work in. Despite appearing to be a sea of constant change from the outside. Any change made inside the Leviathan is extremely expensive in engineering hours. There are thousands of potential interactions each engineering team has to consider when evaluating changes to their system. The architecture must be constrained heavily to support parallel development in environments where coordination between different teams is impossible due to scale. Engineers working on a software leviathan spend a relatively small fraction of their time actually writing code as compared to debugging issues, research, coordinating changes, and documenting. 

Leviathans are interesting because they are the ‘core’ services powering the digital world these days. Their scale is at top of the chart in the software engineering world and as a result they expose the limitations of software engineering. 

Software diseconomies of scale are at their most evident in these software leviathans. They are massive projects with huge numbers of the best engineers working on them. But development is slow per engineer and code quality is not clearly superior to industry best practices. 

Everyone uses (failing) software all the time.

Because you use it all the time at least one piece of software is broken for you at all times.

I stopped using Facebook after my freshman year of college, but recently got pulled back in by a Facebook group. As a result I now have the pleasure of enjoying a 10+ second loading phase every time I open the homepage. 

Recently, I tried to buy a CODE mechanical keyboard on the wasdkeyboards.com website. But every time I submitted my order it failed. I tried different browsers. I had to look into the console to find out that a http request was failing to find a paypal advertising domain that my PiHole blocks on the network. To buy my keyboard I had to tether wifi from my smartphone. A non-technical user wouldn’t have been able to find out why the order failed because there was no error message. There was a spinning symbol that just disappeared after a while without a message to the user. 

Everyone uses software all the time now. We have smartphones, smart TVs, smart refrigerators and smart homes. If you use 100 programs a day, 99% uptime means one program is down for every person. If every application manages 99.9% uptime, one out of a hundred people is experiencing software brokenness everyday. 

Then realize that billions of people have smartphones now. 

99.99% * 1,000,000,000 = 100,000. 

If your software has a billion users and works 99.99% of the time, its down for 100,000 people all the time.