Software Industry – Page 7 – Sledgeworx Software

Software advances are slower than you expect.

Most people think of technological advances using the eureka metaphor. But software doesn’t work like that. Take a clear technological advance like a self-driving tractor. They are on the market now but there was no eureka moment, no sudden breakthrough. Self-driving technology just advanced to the base level required for tractors on clearly mapped fields, then a team of software engineers built a working system over several years.

There’s no invention or breakthrough moment, just a slow build in none software capabilities than an investment in building out software to leverage those capabilities.

There were no software engineering advances that enabled the self-driving tractor. Instead machine vision improved until it was good enough to unblock the software solution.

What advances in software are in sight?

Reproducible builds are one of my favorite advances in software recently. Strictly speaking reproducible builds have been available for quite some time, but that doesn’t mean it isn’t a real advance. Having a trustworthy build environment where you can debug the inputs vs the outputs for different machines is a big benefit. Without reproducible builds software engineers end up spending extra time fixing build failures and figuring out dependency resolution overrides.

Golang is in my opinion an advancement in software engineering. Instead of focusing on productivity for a single or small number of engineers, Golang is focused on maximizing productivity for large software projects with thousands of engineers. It does this by simplifying the language as much as possible to ease readability and eliminate magical effects. Everything in Golang is very procedural and easy to follow.

Easy to use gradual typing is a more recent advancement in the Python and Ruby worlds. You can now build your MVP in a dynamic quick evaluating language and then after you reach product market fit you can add types to the code base. Typing is no longer and either/or thing but a sliding scale where you can chose the most opportune time to transition. Overall I think gradual typing has huge advantages compared to the old standard of rewriting the codebase in Java.

How to make oncall great on your team.

Being on-call for software can suck. It seems like everyone has a horror story of being woken up at 3am for an outage in their service, and then having to work until daylight to get things working again. I have been on an on-call rotation for about two years now and it has gone very well. We typically get paged during off hours once per week and resolving a page usually takes under three hours. Here are some thoughts on what makes an on-call experience great.

Page Frequency

The biggest thing in my opinion is that the page frequency is on the lower end at 1-2 off hours pages per week. Off hours just means outside of the typical business hours of 9-5. A wake up would be getting paged when you were asleep in the evening. Having a low rate of pages is important because it means your on-call can get adequate rest during the week. In the worst case scenario where your on-call gets woken up twice during a week long on-call shift, they will still be operating reasonably well by week end. Page frequency is especially important if you on-call rotation is smaller.

Rotation size

Oncall rotation size is important. You need people to spread the load around. I’ve worked in rotations ranging from 3-11 people and 5-10 is the sweet spot. At that size you have about a month off between on-call shifts. Beyond 10 people and you will start to get rusty since people are only on-call about every 2 months. Having more people also makes it easier to support vacations without anyone feeling like they didn’t get a break from being on-call. Smaller rotations are bad because the engineers on the team will not get enough time between shifts to complete project work. Feature development will stall and your team basically becomes an ops team. Additionally, the bus factor is too low in a small rotation, if one person goes on vacation and the other has a power outage you might ended up with no one to respond to a page.

Clear duties

On my team we have a clear list of things we do in response to a page. Anything else will be left for business hours.

The things we do are;

1. Scale up or down the fleet

2. Turn on or off a feature toggle

3. Rollback a deployment

4*. Rollforward a fix

Note that rolling forward or patching production are the last resort. Making a code change is the slowest way to address an outage and the highest risk. Whenever possible you want to make code changes during normal office hours.

Good Runbooks

Having good runbooks reduces the cognitive load when dealing with a service outage. They can also save significant amounts of time to fix common problems just by having the steps taken to fix last time recorded. In your regular on-call shift review meeting its best to add new entries to the runbook to cover pages during that week

How to know your oncall team is in a good place?

If it is easy for people on the team to get someone to cover for them when they go on vacation.
People on the team volunteer to be oncall for peak events
People don’t complain about being oncall during 1 on 1 reviews

Software Leviathans strain on the programming job market.

Why I’m not worried about H1B, outsourcing or remote work.

Software leviathans dominate the market due to diseconomies of scale. Leviathans are a bit of a self-fulfilling prophesy. You create a thing like Facebook and it starts to take off. Then you find a way to make money off of it. Then due to marginal costs you end up hiring 10,000 engineers to maximize the value of ads on facebook.

Software that is valuable gets bigger over time. Due to diseconomies of scale it gets even more expensive to maintain. But counteracting these diseconomies of scale are the natural monopolies like Facebook, which solve the problem by pouring more money into it. Hiring the absolute best programmers to fight the information problem back a little longer.

Leviathans drive demand for the best programmers. And importantly that demand is far above the number of engineers at the peak of skill. This demand has so far created a bifurcation in the job market with FAANG salaries and restricted stock options surging ahead of pay in the rest of the market. The bifurcation has persisted over the last decade despite FAANG opening offices in India and China, H1B visas and the surge in new Computer Science majors entering the market.

Now in 2020 the big shock is remote work. We just spent the last year working remotely. Lots of accountants are thinking to themselves, “Why are we paying San Fransisco salaries when we could be paying less than half that anywhere else on the planet?”

We are moving towards a ‘Remote first’ programming market. Where anyone in the correct timezones can fill any role at a top tier company. This should reduce compensation a bit since the cost of real estate in a few cities has been a major driver of FAANG salaries. But it won’t change the fundamental problem which is diseconomies of scale in software. FAANG and other big tech companies will still pay higher compensation than everyone else. The terms will just be a little different, instead of making 500k in Seattle senior engineers will make $200k anywhere they want to live.

You might think this is a bad thing for software engineers since we will be getting paid less overall. But that misses two important factors the first being land costs. Not everyone wants to live in San Fransisco, San Jose, Seattle and New York. I for one would never have moved to Seattle if I wasn’t promised nearly double what I was making in Denver at the time.

The second factor is that remote work is not going to be a software engineer only change. Most other white collar desk jobs can also be done remotely. Which means they will also see a drop in compensation. Lawyers don’t really need to do anything in person, they certainly managed to keep working through the pandemic. Why hire an expensive law firm in Atlanta when you can get the same remote lawyer based in Montana for one fifth of the price?

Remote work reduces the locality of labor. This will result in labor prices globalizing. Programmers salaries will become more consistent across the globe. At the same time diseconomies of scale and the sheer demand for software will act to keep programming demand high.

But other industries that do not have the same level of demand as software will also see their compensation globalize. This will most easily be seen in a reduction in the price of white collar services globally.

The future looks extremely deflationary to me. The prices of white collar labor will drop due to remote work and the price of manual labor will drop due to automation. The people who come out on top will be white collar workers who live in low cost countries and the owners of capital.

Software Leviathans

Dis-economies of scale, why FAANG pays high salaries, the dominance of Java

The top end of software engineering jobs are dominated by what I’ve started thinking of as ‘Software Leviathans’, large software systems that are staffed by thousands of engineers. A few that come to mind are Amazon Alexa, Amazon.com, Google Search, Salesforce, Facebook.com. These are not “monoliths’ or large services that do everything. Instead they are the result of combining 100s of smaller ‘micro-services’ into one massive software product.

These leviathans do many many things, few people on the planet can claim to know all of the features of facebook.com. It is quite possible that there exists no single list that enumerates every feature in that product.

Similarly, development on these systems happens in parallel across many teams. It it is essentially impossible for any one person to keep track of everything that is being added to the system.

Leviathans are too big for anyone to understand. It doesn’t matter what architecture or runtime choices are made. It could be one massive JVM, a million lambda functions, a hundred thousand docker containers or thousands of micro-services. Even if you work on the leviathan, you won’t have any real understanding of the total state of the system. Each engineer will be aware of and communicate with a tiny fraction of the total number of people working inside the leviathan.

Leviathans are heterogeneous systems. The do not do ‘one thing well’. Leviathans do everything you can think of. Google.com is a search engine, but it’s also a calculator, an advertising system, a web scraper, a hotel booking tool, a flight booking tool, and many more. Leviathans grow in parallel, across myriad tentacles of functionality. New features emerge all the time usually to the surprise of other engineers on the project.

Leviathans are difficult to work in. Despite appearing to be a sea of constant change from the outside. Any change made inside the Leviathan is extremely expensive in engineering hours. There are thousands of potential interactions each engineering team has to consider when evaluating changes to their system. The architecture must be constrained heavily to support parallel development in environments where coordination between different teams is impossible due to scale. Engineers working on a software leviathan spend a relatively small fraction of their time actually writing code as compared to debugging issues, research, coordinating changes, and documenting.

Leviathans are interesting because they are the ‘core’ services powering the digital world these days. Their scale is at top of the chart in the software engineering world and as a result they expose the limitations of software engineering.

Software diseconomies of scale are at their most evident in these software leviathans. They are massive projects with huge numbers of the best engineers working on them. But development is slow per engineer and code quality is not clearly superior to industry best practices.

Don’t use internal tooling, contribute to your tools.

There is a dichotomy in software engineering organizations, some only use public tooling, entirely avoiding building their owns tools. Other companies follow the ‘Not invented here’ principle and try to only use software developed internally.

There are two forces driving this split, first building software tooling is expensive, small companies often cannot afford to build and support internal tools. It is an expensive recurring cost that can easily get out of hand. Building 2 internal tools a year will set you on a path to supporting 20 tools 10 years from now eating huge chunks of your operations budget.

The other force in play is that if you use public or off the shelf tooling you will encounter workflow discontinuities that are difficult to fix. Using off-the-shelf tool A with tool B might require an entire employee to bridge the gaps manually, while still be a pain for everyone involved. Decision makers at some companies think to themselves, “We can just build a tool B that works perfectly with our use case.” And that works great when you have one tool, but then when need C comes along your team makes the same case again, “We already invested in custom tool B we can’t throw away that work, we need a new custom tool for need C”. And now your company is on the path towards building an alphabet’s worth of internal tools that aren’t useful outside of your business.

Luckily, there is a solution to this. Use Open Source tooling and when you run into a workflow problem, work with the maintainers to contribute a fix. Even in poorly managed projects extending the code to support your use case will be less costly than building an entirely new internal tool to solve the same problem.