A few things I learned after 3~ years at Amazon

two is better than zero” is definitely harmful. 

The phrase “two is better than zero” – Jeff Bezos drives a lot of the technical disfunction inside of Amazon. I typically explain it to people by saying “Amazon has 3 internal tools for every job with one tool’s worth of documentation”.  Which is basically true, for every problem internally we have 0-3+ tools to tackle it. When our team would come upon issues we would review the internal solutions typically one would be deprecated, another had poor documentation, the third kind of fit our use case, the AWS service would have too high latency.

I’ll describe a few cases to help convey the insanity. 

I worked in Alexa which has the concept of a ‘Prompt’ which is some text which Alexa reads aloud to customers. There is a central tool which governs prompts, but it expects you to manually handle promoting prompts from dev to prod. So our team built a ‘prompt pipeline’ service which automates prompt promotions via the central service’s API. One of the engineers on our team got promoted off of that project. About six months later I get an email congratulating another engineer on his promotion, with the comment that he built a prompt pipeline for his team. It isn’t like our teams were super far apart from each other. The 2nd team actually contributed code to the service my team operated. There was basically zero reason to build another prompt pipeline, we should have added the features to the central service for everyone.

The test accounts situation inside Amazon is another area of craziness. There are at least 3 different services that can create accounts, one of them actually works so we depended on that one. Everything was great, we automatically created accounts and ran end to end tests. Months later we start to hit throttling limits on our account creation. We do some research, and it looks like our accounts are getting flagged for fraud and terminated. I reach out to a few fraud teams to find out what is happening. Well, our test accounts look a lot like they stole credit cards and are getting flagged for that reason. I ask if they can whitelist accounts made by the standard test account service we use. They say they don’t have any support for that and recommend manually curating test accounts. 

Amazon doesn’t have a tool like Splunk internally. You can’t search across all your services and get logs that way. The industry standard is to have great log search, to the point that free services like Kibana support it. Inside Amazon we have 5~ different tools that provide various levels of log viewing and searching. About half of them are tools to facilitate using Grep to find errors. There is at least one service that supports log searching but it is honestly rather difficult to figure out how and often doesn’t work. 

Amazon has multiple infrastructure as code tools. We have CloudFormation and the CDK (Cloud Development Kit) which are AWS sponsored public tools. But we also have a ruby based tool (no not chef) which does infrastructure as code. Then we have another tool that is based on YAML. By the end of my time at Amazon my team supported services with components defined across four different infrastructure as code tools. 

The tooling situation inside Amazon isn’t great. I think mostly due to this “two is better than zero” philosophy. There are cases where it is awesome to just be able to build what you need. And my team took advantage of that. But over the years when every tool you use sucks in multiple easily fixable ways, you start to realize the cost you are paying. 

Document driven meetings are great 

Amazon takes document driven meetings seriously. There is always a ‘Doc’ hosted in Quip which is a tool like google docs. Everyone on the team can read and comment on the document easily. The first 15-20 minutes of a meeting are devoted to reviewing the document. Then we will review the comments and discuss the document during the rest of the meeting. It is great, and you never have to worry about getting people to read the document before the meeting. There are never any power points or scenarios where you have to listen to someone describe something at 1/5 the speed you could have read the document. Another benefit is that you actually have some documentation for every architectural change or new project. If your organization relies on brainstorming and white boarding for architectural changes you may run into a scenario where those don’t end up documented. At Amazon that isn’t a problem because you have to have a document if you want to share your new approach. 

While great overall, Amazon’s document driven culture has some short comings. The first is that there is no standard way to capture meeting notes. If agreements were made during the meeting usually the task of recording the decision was passed onto whoever wrote the document. 

Another shortcoming is that no one figured out how to make ‘Agile’ meetings document driven. So we end up doing sprint planning, retrospectives and backlog grooming in the old inefficient way. 

A big benefit in my opinion is that all quarterly and yearly organization goals are included in a document. Which your director and project managers will share with you. So you will actually get to see written goals for your organization. As opposed to hoping the CEO/VP tell you what their latest idea was. 

How to know when you really need Microservices.

You need micro services when you have too many developers to fit software through the deployment pipeline consistently.  If you can’t manage a weekly deployment there are too many cooks in the kitchen.  My team hit a couple inflection points along the way from thirty engineers on the service to somewhere around 200. We had 300+ people in our support slack channel but there is no way to know if they all contributed code every release. 

One inflection point was when our oncall rotation also became the ‘release’ engineer rotation. At that point we had a minimum of one person assigned to ‘operations’ at all times. Next we reached a point where we had to split out e2e tests into separate packages because our team couldn’t keep track of how all the different features were supposed to work in the service. 

Then we started keeping 2 people on ‘operations’ at all times because our ticket queue started growing despite regular ‘bug bashes’. The last level we reached is the one where having 2 people assigned full time to operations was no longer enough. We were falling behind on deployments, mandatory migrations and our ticket queue. So we started outsourcing release management to another team. Then we started losing all of our best and most experienced people and how I ended up here. 

Bad architectural solutions can ‘curse’ your organization for years. 

Architecture is really important. Seemingly simple decisions can have long term effects that aren’t obvious. In Alexa Shopping where I worked at Amazon, we operated inside a dynamic workflow engine. To make that happen every API in our department of 1200 people shared the same schema. To handle the needs of various components the schema included an ‘Envelope’ type which was an array of arbitrary JSON objects. 

There was a lot of tooling built around specifying what would be passed into a node. But the long run effect was that each API in our service shared the same schema in the form of a java type. But each API customized that java type in arbitrary ways via the envelope. This converted a strictly typed interface into a dynamically typed interface. Except the types were strict, but the limitation was enforced in another service not Java.

The envelopes solved a lot of problems but one thing it made harder was integration tests. Integration or functional tests in this context means tests against your service with the request mocked. We found that software engineers relied heavily on end to end tests but almost no one bothered with or trusted the integration tests. The reason came down to the DSL (domain specific language) used to write our end to end tests vs our integration tests. 

The DSL for end to end tests used natural language to trigger Alexa functionality. You would just write what the customer would say to Alexa then declare what responses you wanted back. It took a lot of work to support, but our end to end tests were easy to write and caught a lot of bugs. They were hard to debug in a lot of cases but still easier than our integration tests. 

The DSL for integrations tests on the other hand was much harder to understand. The sticking point was all the work required to create the inputs and expected outputs for our APIs. As mentioned above each API input included an arbitrary JSON array the types in which were defined in configuration in another service. And as our APIs were very large it was very hard to figure out what you needed in your input request for an integration test.

Our APIs also supported multiple flows through the workflow engine, each of which would receive a different subset of input types. So one functional test against your API would receive a certain subset, the next test would need a different subset of types to trigger a different piece of functionality. For the average software engineer or partner it was very difficult to figure out what exactly you needed to do different in your integration tests compared to another pre-existing integration test. 

I addressed the problem by simply not writing integration tests at all. Of course I still had to debug them when I was supporting the pipeline which was quite difficult. Tests were documented via name only. The failure log typically contained only  “Test y failed, expected 4 entities, but received only 2“. Which entities are missing? We don’t know. We also didn’t know which entities were passed into the test. Figuring that out requires you to read through several thousand lines across multiple Java classes that setup partial mocking. Another tip is to never run an integration test system that only mocks some of its dependencies. 

Five years after the original architectural solution was designed we still didn’t have a good way to write integration tests against our APIs. I originally realized you could just say that the API schema was underspecified, which hopefully makes sense. 

It is hard to write a test against an API that doesn’t have a schema and does who knows how many things. Many of which are legacy and conform to architectural ideas no one remembers. 

The development environment is extremely important for developer happiness / retention 

How much of your life do you want to spend working on development toil? Most of us prefer to write useful code over doing busy work that provides no longterm value. But unfortunately, it is easy to fall into traps where the development environment is bad but fixing it is hard so people just put up with it until they can get another job. 

Part of the problem is that Amazon has its own build system. It’s named after a South American country the name of which starts with a B. You might be thinking of Bazel but that is another build system open sourced by Google. B is a different build system. 

B has many issues but the one that killed it for me is the integration between B and IntelliJ. Somehow we reached a point where the non-senior engineers on the team couldn’t run unit tests in IntelliJ anymore. I complained about this several times and we never managed to fix it. Eventually the Senior engineers who set everything up all quit and joined other companies. I’d given up on fixing it myself after a few attempts to figure out what the issue was. So there I am making 200k a year and I can’t even get unit tests to run in a debugger. 

The moral of the story is use Maven, Gradle or Bazel. I’m not interested in spending significant amounts of my life investigating why a particular java project will not build in IntelliJ. If you want to maintain your own build system be my guest. I’m just not going to work on it. 

The people who built that system were very good highly compensated engineers. They just didn’t prioritize the developer environment since there was a lot to do. And the senior members of the team already figured out the warts. The management team needs to focus on the development environment because in the end problems there will manifest in turnover among new hires. If you have a lot of turnover at the one and two year mark take a serious look into your development environment. 

Leave a Reply

Your email address will not be published. Required fields are marked *