A few things I learned after 3~ years at Amazon

two is better than zero” is definitely harmful. 

The phrase “two is better than zero” – Jeff Bezos drives a lot of the technical disfunction inside of Amazon. I typically explain it to people by saying “Amazon has 3 internal tools for every job with one tool’s worth of documentation”.  Which is basically true, for every problem internally we have 0-3+ tools to tackle it. When our team would come upon issues we would review the internal solutions typically one would be deprecated, another had poor documentation, the third kind of fit our use case, the AWS service would have too high latency.

I’ll describe a few cases to help convey the insanity. 

I worked in Alexa which has the concept of a ‘Prompt’ which is some text which Alexa reads aloud to customers. There is a central tool which governs prompts, but it expects you to manually handle promoting prompts from dev to prod. So our team built a ‘prompt pipeline’ service which automates prompt promotions via the central service’s API. One of the engineers on our team got promoted off of that project. About six months later I get an email congratulating another engineer on his promotion, with the comment that he built a prompt pipeline for his team. It isn’t like our teams were super far apart from each other. The 2nd team actually contributed code to the service my team operated. There was basically zero reason to build another prompt pipeline, we should have added the features to the central service for everyone.

The test accounts situation inside Amazon is another area of craziness. There are at least 3 different services that can create accounts, one of them actually works so we depended on that one. Everything was great, we automatically created accounts and ran end to end tests. Months later we start to hit throttling limits on our account creation. We do some research, and it looks like our accounts are getting flagged for fraud and terminated. I reach out to a few fraud teams to find out what is happening. Well, our test accounts look a lot like they stole credit cards and are getting flagged for that reason. I ask if they can whitelist accounts made by the standard test account service we use. They say they don’t have any support for that and recommend manually curating test accounts. 

Amazon doesn’t have a tool like Splunk internally. You can’t search across all your services and get logs that way. The industry standard is to have great log search, to the point that free services like Kibana support it. Inside Amazon we have 5~ different tools that provide various levels of log viewing and searching. About half of them are tools to facilitate using Grep to find errors. There is at least one service that supports log searching but it is honestly rather difficult to figure out how and often doesn’t work. 

Amazon has multiple infrastructure as code tools. We have CloudFormation and the CDK (Cloud Development Kit) which are AWS sponsored public tools. But we also have a ruby based tool (no not chef) which does infrastructure as code. Then we have another tool that is based on YAML. By the end of my time at Amazon my team supported services with components defined across four different infrastructure as code tools. 

The tooling situation inside Amazon isn’t great. I think mostly due to this “two is better than zero” philosophy. There are cases where it is awesome to just be able to build what you need. And my team took advantage of that. But over the years when every tool you use sucks in multiple easily fixable ways, you start to realize the cost you are paying. 

Document driven meetings are great 

Amazon takes document driven meetings seriously. There is always a ‘Doc’ hosted in Quip which is a tool like google docs. Everyone on the team can read and comment on the document easily. The first 15-20 minutes of a meeting are devoted to reviewing the document. Then we will review the comments and discuss the document during the rest of the meeting. It is great, and you never have to worry about getting people to read the document before the meeting. There are never any power points or scenarios where you have to listen to someone describe something at 1/5 the speed you could have read the document. Another benefit is that you actually have some documentation for every architectural change or new project. If your organization relies on brainstorming and white boarding for architectural changes you may run into a scenario where those don’t end up documented. At Amazon that isn’t a problem because you have to have a document if you want to share your new approach. 

While great overall, Amazon’s document driven culture has some short comings. The first is that there is no standard way to capture meeting notes. If agreements were made during the meeting usually the task of recording the decision was passed onto whoever wrote the document. 

Another shortcoming is that no one figured out how to make ‘Agile’ meetings document driven. So we end up doing sprint planning, retrospectives and backlog grooming in the old inefficient way. 

A big benefit in my opinion is that all quarterly and yearly organization goals are included in a document. Which your director and project managers will share with you. So you will actually get to see written goals for your organization. As opposed to hoping the CEO/VP tell you what their latest idea was. 

How to know when you really need Microservices.

You need micro services when you have too many developers to fit software through the deployment pipeline consistently.  If you can’t manage a weekly deployment there are too many cooks in the kitchen.  My team hit a couple inflection points along the way from thirty engineers on the service to somewhere around 200. We had 300+ people in our support slack channel but there is no way to know if they all contributed code every release. 

One inflection point was when our oncall rotation also became the ‘release’ engineer rotation. At that point we had a minimum of one person assigned to ‘operations’ at all times. Next we reached a point where we had to split out e2e tests into separate packages because our team couldn’t keep track of how all the different features were supposed to work in the service. 

Then we started keeping 2 people on ‘operations’ at all times because our ticket queue started growing despite regular ‘bug bashes’. The last level we reached is the one where having 2 people assigned full time to operations was no longer enough. We were falling behind on deployments, mandatory migrations and our ticket queue. So we started outsourcing release management to another team. Then we started losing all of our best and most experienced people and how I ended up here. 

Bad architectural solutions can ‘curse’ your organization for years. 

Architecture is really important. Seemingly simple decisions can have long term effects that aren’t obvious. In Alexa Shopping where I worked at Amazon, we operated inside a dynamic workflow engine. To make that happen every API in our department of 1200 people shared the same schema. To handle the needs of various components the schema included an ‘Envelope’ type which was an array of arbitrary JSON objects. 

There was a lot of tooling built around specifying what would be passed into a node. But the long run effect was that each API in our service shared the same schema in the form of a java type. But each API customized that java type in arbitrary ways via the envelope. This converted a strictly typed interface into a dynamically typed interface. Except the types were strict, but the limitation was enforced in another service not Java.

The envelopes solved a lot of problems but one thing it made harder was integration tests. Integration or functional tests in this context means tests against your service with the request mocked. We found that software engineers relied heavily on end to end tests but almost no one bothered with or trusted the integration tests. The reason came down to the DSL (domain specific language) used to write our end to end tests vs our integration tests. 

The DSL for end to end tests used natural language to trigger Alexa functionality. You would just write what the customer would say to Alexa then declare what responses you wanted back. It took a lot of work to support, but our end to end tests were easy to write and caught a lot of bugs. They were hard to debug in a lot of cases but still easier than our integration tests. 

The DSL for integrations tests on the other hand was much harder to understand. The sticking point was all the work required to create the inputs and expected outputs for our APIs. As mentioned above each API input included an arbitrary JSON array the types in which were defined in configuration in another service. And as our APIs were very large it was very hard to figure out what you needed in your input request for an integration test.

Our APIs also supported multiple flows through the workflow engine, each of which would receive a different subset of input types. So one functional test against your API would receive a certain subset, the next test would need a different subset of types to trigger a different piece of functionality. For the average software engineer or partner it was very difficult to figure out what exactly you needed to do different in your integration tests compared to another pre-existing integration test. 

I addressed the problem by simply not writing integration tests at all. Of course I still had to debug them when I was supporting the pipeline which was quite difficult. Tests were documented via name only. The failure log typically contained only  “Test y failed, expected 4 entities, but received only 2“. Which entities are missing? We don’t know. We also didn’t know which entities were passed into the test. Figuring that out requires you to read through several thousand lines across multiple Java classes that setup partial mocking. Another tip is to never run an integration test system that only mocks some of its dependencies. 

Five years after the original architectural solution was designed we still didn’t have a good way to write integration tests against our APIs. I originally realized you could just say that the API schema was underspecified, which hopefully makes sense. 

It is hard to write a test against an API that doesn’t have a schema and does who knows how many things. Many of which are legacy and conform to architectural ideas no one remembers. 

The development environment is extremely important for developer happiness / retention 

How much of your life do you want to spend working on development toil? Most of us prefer to write useful code over doing busy work that provides no longterm value. But unfortunately, it is easy to fall into traps where the development environment is bad but fixing it is hard so people just put up with it until they can get another job. 

Part of the problem is that Amazon has its own build system. It’s named after a South American country the name of which starts with a B. You might be thinking of Bazel but that is another build system open sourced by Google. B is a different build system. 

B has many issues but the one that killed it for me is the integration between B and IntelliJ. Somehow we reached a point where the non-senior engineers on the team couldn’t run unit tests in IntelliJ anymore. I complained about this several times and we never managed to fix it. Eventually the Senior engineers who set everything up all quit and joined other companies. I’d given up on fixing it myself after a few attempts to figure out what the issue was. So there I am making 200k a year and I can’t even get unit tests to run in a debugger. 

The moral of the story is use Maven, Gradle or Bazel. I’m not interested in spending significant amounts of my life investigating why a particular java project will not build in IntelliJ. If you want to maintain your own build system be my guest. I’m just not going to work on it. 

The people who built that system were very good highly compensated engineers. They just didn’t prioritize the developer environment since there was a lot to do. And the senior members of the team already figured out the warts. The management team needs to focus on the development environment because in the end problems there will manifest in turnover among new hires. If you have a lot of turnover at the one and two year mark take a serious look into your development environment. 

Two issues facing software in the 2020s

I have been thinking about the problems in software a bit lately. As I see it we have 2 core issues, the first is the diseconomies of scale in software. The more and bigger the software, the more it costs to develop. The second issue is software security and maintenance. 

Diseconomies of scale are an issue because we keep writing more software. And the expectations of that software continue to rise. Smartphones started out with phone calls, texting, and surfing the web. Now your smartphone tracks your exercise, interacts with your smart home devices, does payments and more. As the smartphone becomes your main connection to the digital its complexity increases, and at the same time the expectations for everything else increase. 

I have four potted plants in my apartment. I could have smart water meters in each pot that track when I need to water those plants. In the past no one would have expected your plants to be smart. Today it is an obvious next step for people who’s plants always die. 

Since your smartphone is in many ways you, long term I expect things like door locks to transition from keys to phones. Cars have already made this transition. My car unlocks via a button press. Using the actual key requires me to first remove it from the dongle. 

Now a serious question is, “Why do you need a dongle to start your car?” Your phone has all the capabilities of the dongle. You don’t use the key anyway, what is the point of the dongle? Extend that question to everything in your life. 

Why not have a way to check on your phone if you left the oven on? 

Why not have your water heater hooked up to your phone so you know if your kids left you any hot water?

Why not have a GPS dog collar hooked up to your phone so you always know where your dog is?

Gradually everything moves to your phone, and as more things move to your phone you will begin to ask “Why isn’t this on my phone yet?” 

The phone ends up as an incredibly complex, absolutely essential part of your life. And as it does so it pushes everything else to connect to it. 

We are on a trajectory that yields an incredible amount of computerization. Who is going to write all that code? It is going to be incredibly expensive with today’s linear coders. 

Software maintenance is the second big issue. We keep building more and more software, but we aren’t getting any better at maintaining it. Software is a weird animal because it is never ‘done’. You can reach a point where software is feature complete and seems perfect. But not really, hackers could discover a vulnerability you must patch at any time going forward. If that happens you need to have people who can update your software quickly. But if you reached feature complete 10 years ago and laid off all the programmers, your project is now poisonous. No one can use it until you hire and train people to update that software. 

Software seems like it can be completed, but that is a trap. Security is not the only risk. Operating System updates are another problem. Microsoft and Apple constantly update their systems and API requirements. Customer expectations change as the software marketplace changes. 

The final trap is what happens when you get off the treadmill. Say your software is complete, it works perfectly and has no vulnerabilities. So you let it sit with minimal updates, a skeleton crew to keep it running. Well, ten years from now no one will be willing or able to modify that software. Maybe it’s Cobol, or Java 8 or Typescript, but the tech is now old. Any programmer who works on it will have to spend time learning technology that is worthless long term. 

Software Leviathans and the weird dominance of good enough.

One day in Spring 1989, I was sitting out on the Lucid porch with some of the hackers, and someone asked me why I thought people believed C and Unix were better than Lisp. I jokingly answered, “because, well, worse is better.” We laughed over it for a while as I tried to make up an argument for why something clearly lousy could be good.

https://www.dreamsongs.com/WorseIsBetter.html

It has long been wondered why Java took the crown for the ‘enterprise’ language. I can’t really argue on that topic since I came onto the scene long after Java was all there was. This article is about why software leviathans are written in Java more than anything else. 

You have a huge software project to build. What language do you build it in? The prototype was written in ruby on rails by one guy and an Adderal prescription. Now they want you to scale this thing to a 1000+ engineers over 5 years of development. You might think “aha this is my chance, lets save an order of magnitude lines of code and use lisp”, except this story happened in the past and they chose Java. 

Why is it always Java? Sure it’s reasonably fast, but Facebook made PHP work, can’t we at least use Haskell? Since we have the benefit of hindsight, we know that most of the biggest software systems are built in Java. Google built so many leviathans in Java that they bankrolled a new language like Java but with less features. Amazon is based on Java. Netflix is java again. Facebook made their own language and Microsoft is old enough to have existed before Java, but still made their own version of Java, C#. 

The real question should be, “What is Java’s secret?”. 

Java requires a lot of boiler plate

Java just plays well with the major constraints in a software leviathan and at Leviathan scale that is all that matters. 

This one is the corollary of “Java doesn’t support meta-programming”. Creating your own DSL is great, 1000 engineers creating their own DSL is 999 nightmares. Software Leviathans are too big for any one engineering team to understand. Any DSL you create makes your code unintelligible to the rest of the people working in hell with you. I can understand boilerplate written by a monkey, but a DSL written by another software engineer could take me days to understand. When your team gets poached to go work on a startup where the code base isn’t humongous, it’s a lot easier to bring in Java programmers to replace you lot than it would be to get Haskell engineers to figure out your undocumented dialect. 

Google got to the point where they figured Java had too much meta-programming ability so they created Go which is basically Java without inheritance. That is what happens when you work in a leviathan project. You begin to resent the ability of your peers to do anything unusual, because you know it’s just going to be more work for you. 

Adding more onboarding time to understand 1) the functional language and 2) the DSL your team created might push our already long 6 month on-boarding period closer to the 1 year mark. I wrote an article about onboarding time and functional languages aimed at startups, but honestly I don’t think the hiring market is the real reason Java dominates the top end. FAANG is already willing to train new grads to work on their giant software projects. 

It boils down to comprehension honestly. Humans can only comprehend so many things and at leviathan scale the max is a tiny fraction of the entire system. 

In a software leviathan your team constantly works with other teams’ systems. How does this API work? There isn’t any documentation and one 30 minute office hours isn’t going to explain that hair ball. If you all use the same language and that language is Java there is a chance you can open up their code base and figure out what is going on. They probably didn’t do anything you wouldn’t expect like pre-allocating all of their memory and storing all objects into a ring buffer. But if they did do something crazy you can probably figure it out. Besides Java doesn’t have anything like Scalaz so you won’t be surprised by a functor where you weren’t expecting it. 

Lets take the opposite side, away team work. You have been given the glorious task of implementing a new feature. But it’s impossible to do it cleanly without an API change in another team’s system. That team fully supports the change and has contributed 2 paragraphs to your architecture document describing the change to make in their system. But the change isn’t on their team’s roadmap so you are going to have to do it. 

Getting their service to run and pass integration tests in your virtual development machine takes a week. Now you need to navigate their system where they have conveniently used dependency injection to ensure that you can’t know which of the 5 implementations of this interface is in play. Do you still wish the other team could use Clojure? You might never figure out the DSL. 

Have you ever looked at somebody else’s Lisp code and wondered what was inside the variables? Now imagine this is your job and you will spend the next month making a 200 line change to a 100,000 line of code API service you didn’t know existed until this week. Except this will happen every quarter for the rest of your career. 

People complain about how Java forces you to write the type of things everywhere, but for software leviathans this is a benefit. I can see helpful type signatures everywhere, whether I’m reading your code in my IDE, an email, an excerpt in an arch doc, or in a Slack message you sent me at 3 am. 

Java and Go are great in Software Leviathans. You don’t have to worry about stumbling upon a programming mystery created 10 years ago by a disgruntled new grad. You can expect a consistent syntax and language whichever microservice you are working on. The code has self-documenting types that are ‘easy’ to understand. Honestly, they are a lot of benefits which make a tough coding environment a little more manageable.

Software Leviathans

Dis-economies of scale, why FAANG pays high salaries, the dominance of Java

The top end of software engineering jobs are dominated by what I’ve started thinking of as ‘Software Leviathans’, large software systems that are staffed by thousands of engineers. A few that come to mind are Amazon Alexa, Amazon.com, Google Search, Salesforce, Facebook.com. These are not “monoliths’ or large services that do everything. Instead they are the result of combining 100s of smaller ‘micro-services’ into one massive software product. 

These leviathans do many many things, few people on the planet can claim to know all of the features of facebook.com. It is quite possible that there exists no single list that enumerates every feature in that product. 

Similarly, development on these systems happens in parallel across many teams. It it is essentially impossible for any one person to keep track of everything that is being added to the system. 

Leviathans are too big for anyone to understand. It doesn’t matter what architecture or runtime choices are made. It could be one massive JVM, a million lambda functions, a hundred thousand docker containers or thousands of micro-services. Even if you work on the leviathan, you won’t have any real understanding of the total state of the system. Each engineer will be aware of and communicate with a tiny fraction of the total number of people working inside the leviathan. 

Leviathans are heterogeneous systems. The do not do ‘one thing well’. Leviathans do everything you can think of. Google.com is a search engine, but it’s also a calculator, an advertising system, a web scraper, a hotel booking tool, a flight booking tool, and many more. Leviathans grow in parallel, across myriad tentacles of functionality. New features emerge all the time usually to the surprise of other engineers on the project. 

Leviathans are difficult to work in. Despite appearing to be a sea of constant change from the outside. Any change made inside the Leviathan is extremely expensive in engineering hours. There are thousands of potential interactions each engineering team has to consider when evaluating changes to their system. The architecture must be constrained heavily to support parallel development in environments where coordination between different teams is impossible due to scale. Engineers working on a software leviathan spend a relatively small fraction of their time actually writing code as compared to debugging issues, research, coordinating changes, and documenting. 

Leviathans are interesting because they are the ‘core’ services powering the digital world these days. Their scale is at top of the chart in the software engineering world and as a result they expose the limitations of software engineering. 

Software diseconomies of scale are at their most evident in these software leviathans. They are massive projects with huge numbers of the best engineers working on them. But development is slow per engineer and code quality is not clearly superior to industry best practices.