Software Leviathans

Dis-economies of scale, why FAANG pays high salaries, the dominance of Java

The top end of software engineering jobs are dominated by what I’ve started thinking of as ‘Software Leviathans’, large software systems that are staffed by thousands of engineers. A few that come to mind are Amazon Alexa, Amazon.com, Google Search, Salesforce, Facebook.com. These are not “monoliths’ or large services that do everything. Instead they are the result of combining 100s of smaller ‘micro-services’ into one massive software product. 

These leviathans do many many things, few people on the planet can claim to know all of the features of facebook.com. It is quite possible that there exists no single list that enumerates every feature in that product. 

Similarly, development on these systems happens in parallel across many teams. It it is essentially impossible for any one person to keep track of everything that is being added to the system. 

Leviathans are too big for anyone to understand. It doesn’t matter what architecture or runtime choices are made. It could be one massive JVM, a million lambda functions, a hundred thousand docker containers or thousands of micro-services. Even if you work on the leviathan, you won’t have any real understanding of the total state of the system. Each engineer will be aware of and communicate with a tiny fraction of the total number of people working inside the leviathan. 

Leviathans are heterogeneous systems. The do not do ‘one thing well’. Leviathans do everything you can think of. Google.com is a search engine, but it’s also a calculator, an advertising system, a web scraper, a hotel booking tool, a flight booking tool, and many more. Leviathans grow in parallel, across myriad tentacles of functionality. New features emerge all the time usually to the surprise of other engineers on the project. 

Leviathans are difficult to work in. Despite appearing to be a sea of constant change from the outside. Any change made inside the Leviathan is extremely expensive in engineering hours. There are thousands of potential interactions each engineering team has to consider when evaluating changes to their system. The architecture must be constrained heavily to support parallel development in environments where coordination between different teams is impossible due to scale. Engineers working on a software leviathan spend a relatively small fraction of their time actually writing code as compared to debugging issues, research, coordinating changes, and documenting. 

Leviathans are interesting because they are the ‘core’ services powering the digital world these days. Their scale is at top of the chart in the software engineering world and as a result they expose the limitations of software engineering. 

Software diseconomies of scale are at their most evident in these software leviathans. They are massive projects with huge numbers of the best engineers working on them. But development is slow per engineer and code quality is not clearly superior to industry best practices. 

Don’t use internal tooling, contribute to your tools.

There is a dichotomy in software engineering organizations, some only use public tooling, entirely avoiding building their owns tools. Other companies  follow the ‘Not invented here’ principle and try to only use software developed internally. 

There are two forces driving this split, first building software tooling is expensive, small companies often cannot afford to build and support internal tools. It is an expensive recurring cost that can easily get out of hand. Building 2 internal tools a year will set you on a path to supporting 20 tools 10 years from now eating huge chunks of your operations budget. 

The other force in play is that if you use public or off the shelf tooling you will encounter workflow discontinuities that are difficult to fix. Using off-the-shelf tool A with tool B might require an entire employee to bridge the gaps manually, while still be a pain for everyone involved. Decision makers at some companies think to themselves, “We can just build a tool B that works perfectly with our use case.” And that works great when you have one tool, but then when need C comes along your team  makes the same case again, “We already invested in custom tool B we can’t throw away that work, we need a new custom tool for need C”. And now your company is on the path towards building an alphabet’s worth of internal tools that aren’t useful outside of your business.

Luckily, there is a solution to this. Use Open Source tooling and when you run into a workflow problem, work with the maintainers to contribute a fix. Even in poorly managed projects extending the code to support your use case will be less costly than building an entirely new internal tool to solve the same problem. 

Software is a depreciating Good

“If customers are paying for the work, presumably by the hour, then you can’t expect them to pay for weeks (or more) of work every few years just to arrive back at the same product they started with.”
– From reddit user /u/PragmaticFinance

People have the idea that because the software has the same features as it did five years ago, it still is just as ‘good’ as it was five years ago. And if we can keep running it without modifications that is a reasonable practice. Unfortunately, while it feels intuitive that is not the reality with software. 

Sorry, a software product that is built on out of date tooling is strictly inferior to software built on up-to-date tooling. It is not ‘just as good’ it is significantly worse. The out of date software has increased vulnerability to security defects. Python 2.7 doesn’t get security patches anymore, if your software is built on it and a zero day is discovered. Your software could be unoperable for weeks while engineers upgrade it to function on Python 3. 

Software that can’t be patched to address security vulnerabilities in a reasonable time frame is strictly worse than up to date software. Anyone who was evaluating a purchase of these two programs would consider the expense of making changes. 

Aside from security issues, out of date software is harder to modify. What are the banks going to do when the last Cobol programmer dies? They will have to fund Cobol bootcamps which is not going to be cheap.

What are companies that had ‘feature complete’ software that didn’t need to change in years doing now that GDPR is a thing? That software is getting dusted off and updated or entirely replaced.

Roads and tractors require constant maintenance just to keep doing the same job. Software is the same. Maintenance costs should be estimated and included in the total cost of ownership for software. 

A lot of business software just encodes business processes. It can get out of date because the world shifted and the process changed, or the world shifted and the way the process is encoded is out of date.

Each time I write one of these technical debt posts it helps me understand why Software As A Service took over in such a big way. If you needed this year’s model of tractor every year you would lease it.

2020 Age of the remote conference

At this point, in person conferences with 1000s of attendees are done for the year. Lockdowns are easing in general, but most people won’t be comfortable going to a massive conference with 1000s of people from all over the world anytime soon.

We are likely to be disrupted by this pandemic until spring 2021, by when travel is hopefully back to normal. I’ve been working from home for months now and will be doing so officially until October. 

The thing is conferences bring a lot of value to engineering. It is a great way to keep up with what people are doing in industry and to share your experience. I’ve attended some great conferences and enjoyed diving into the technical depth available at a conference devoted to Spark or Kubernetes. 

For 2020 these kinds of tech talks and presentations have to move online to Youtube and Zoom just like our work has. Fortunately, there are some benefits to running a tech conference remotely. 

A remote conference can be a lot more affordable. You don’t need to rent a big conference venue to host the talks, and attendees save money by not flying to a different city or renting hotel rooms. 

You also can save time because you aren’t traveling. Just like we don’t need to commute to the office, you don’t need to travel to the conference. The conference is wherever you are. 

The technology is also good for presentations and Q&A sessions. Teleconferencing shines in situations where you want to share screens and only a few people need to speak at once. Where it really breaks down is when you want to have a group discussion, it is much harder to interleave speakers. But if we focus on activities with a single presenter or a Q&A where someone asks a question, then stops talking, teleconferencing is almost as good as being in person. 

I think in 2020 we will see a lot of remote mini-conferences. They are really economical to host and the timing couldn’t be better. I like the idea so much I have decided to adopt the TinyConf I was planning for this year into a remote conference. 

Expecting end users to customize the experience is madness

 Don’t do it to yourselves

Don’t do it to customers 

Do the work to make a good product

Enterprise software sucks. Its not bought by the people using it, but by a guy wearing a suite on the 37th floor the day after eating a fabulous steak dinner paid for by Oracle sales guys.  By the time you start using it, it is bought and paid for. Suck it up and learn how this pile of code works. 

Internal enterprise software is another beast. Constantly underfunded, built by interns that just learned object oriented programming, and designed by the CEOs cousin, it is not the greatest. 

Know what will ensure that your internal software is never improved in a meaningful way? Make customizing it the default workflow. Just have every engineer at the company load up a GreaseMonkey script that adds in the features that PAAS should have by now. 

The problem is fixed for the graybeards. Sure, every new employee will spend six months realizing that all the people who are getting anything done have customized the UI so extensively its not recognizable as the same product. 

When they said go use ‘deployment ladder’, they meant use ‘deployment ladder’ with 12 GreaseMonkey scripts installed. Where are those scripts? You might ask, the answer is always ‘in the wiki’. Searching for the name of the thing in the wiki does not result in finding the thing, like it would in google for an open source project.

Having everyone customize the software does not result in a good product. It papers over a shitty UI by fragmenting it even more. After a while no one with any power in your organization realizes there is a problem because they have 50 GreaseMonkey scripts installed, and haven’t looked at the actual ‘base’ UI in 5 years. 

Save yourself millions in on-boarding. Invest in good tools. Put the work into offering a great default workflow. Don’t end up in a situation where the graybeards can’t even understand the workflows the new hires are dealing with.