The core problem of micro services – how many feature can you fit through the pipeline before it breaks down?

At Amazon I had the chance to watch a monolithic service reach the point where it had to be split up into microservices. When I started we had about thirty software engineers contributing to our service. We had the great idea of developing a framework to speed up feature delivery for everyone in Alexa Shopping. The framework ended up working and the number of people contributing code to our service shot up over the next two years. As of July 2021 we had around 200-300 people in our partner support slack channel.

What happened is that we gradually spent more and more time supporting tests in our CD pipeline. First we had an oncall who did operational support and pipeline release support. Then once that person got swamped and complaints about release frequency got louder we added a second person to operational support. Then we had one person doing releases and outages with a second doing partner pull requests and office hours. 

Growth continued during this time and we attempted a number of changes to federate features and split out responsibility. We split end to end tests into separate suites so it would be easier to find out who knew what a feature was supposed to be doing. This helped a lot, prior to federating the test suites our oncalls would spend a lot of time deep diving end to end tests to figure out what was going on. Afterwards it was a lot easier to find out who we could ask for help. 

One thing that was a huge failure was expecting other teams to debug and fix their end to end tests in the pipeline. Typically, team A has a launch deadline and it asking us to deploy now. We get their code to the staging environment and we see test failure from teams B, C, and D. Teams B, C and D do not have a launch deadline so they are not prioritizing fixing their end to end tests. 

Another big failure was splitting each teams rules into a different repository. We ended up with ten repositories with one relevant file in each. It could have just been a folder in the original project. Plus it was a lot harder to figure out where to put things with ten different projects. The nail in the coffin from my perspective was that the rules were still deployed together. 

One final issue was integration tests. We wanted people to write integration tests, but everybody (including myself) avoided it as much as possible. The reality was the DSL for our end to end tests was significantly better than for integration tests. It was hard to reason how you were testing your specific feature at the API level. But in the end all our feature changes were directly testable at the end to end test level. It was just a lot quicker and easier to write an end to end test than API level test.

Finally, we reached the point where we had to split up the monolith purely because the pipeline was blocking too many teams. It was a risk because in the short term we knew it would increase operational overhead and delay needed upgrades. Unfortunately, the team lost about 50% of its people especially the most experienced ones. And I was one of them so I don’t know how things ended up.