People have extended the goal of building micro service systems such that we aren’t locked into a language to a new goal of every repository being able to use any language.
It’s phrased as ‘shouldn’t we use the best language for the job.‘ But the implication is that we should be able to pick a language for this new project independent of the rest of our existing software stack.
The problem is that it’s not a free choice. Even if you’re in an ideal microservices environment where everything is API based, logging and metrics are handled by the platform and shared libraries are banned there are costs to having more languages in play.
The most basic cost is expertise splintering. If you only use one language and everyone knows it. Then everyone can read and extend all the code. But as you start throwing more languages and paradigms in that slowly becomes impossible.
Polyglot makes away team work more difficult. And it does it slowly so you won’t really notice until it’s too late.
Shared libraries are another area polyglot sucks. You can’t always make a library an API.
Whether it’s because you can’t externalize metrics and logging or you have high performance business logic you need to run locally. Often times you will need to duplicate shared libraries across languages.
Polyglot is nice in a lot of ways. I’d rather get to use Pandas for data science if I had the opportunity. We don’t really need to use only Java for everything. But when it comes to application code, I’d rather everyone used the same language, framework and RPC schemas. It doesn’t make any sense to write certain microservices in Java, others in Haskell, some in Python and a few in Typescript. If they are RPC or event based services you are just using five languages to do the same job instead of one language to do one job.
Then go ahead and write sidecars in Go or Rust, frontends in Typescript and data science stuff in Python. Those are all different use cases with different trade offs.
Away team work is a critical component of high performance software organizations. It is a way for high priority teams to work around other teams’ manpower constraints to deliver software. Without a well established culture of away team project work your organization will default to a standard of ‘shut up and wait while your item is in the backlog’.
What is away team work exactly? Away team work is when your team implements a feature or integration in another team’s codebase. Strictly speaking in away team work the host team does not review or approve pull requests. Your team’s senior engineers will get approval at an architectural level then handle pull request review themselves.
The Away team process allows your team to unblock itself when dependencies don’t have manpower to spare. The key component is first that your team not the dependencies team does the work. Both the implementation, testing and code review. Some effort obviously is required of the host team, but it should be minimized.
Away team work acts as a release valve on the conflict between the host team’s priorities and potential client’s goals of delivering features to customers. If the host doesn’t have manpower available your team can provide the engineers to make it happen.
Without away team work your organization will have to make more tradeoffs on the scheduling side. More often than not you’ll have to pick a migration over the new feature. But Away team work would have enabled your organization as a whole to deliver both.
What do you need to make away team work happen?
The key thing is that you have to make a formal process that specifies the maximum standards. You can’t allow teams to be squeamish about it. If the requirement is that two senior engineers approve that’s fine. Or maybe only senior engineers can do away team work in your company. Thats fine but you need a formal standard to avoid negotiations happening on a per project basis.
At Amazon I had the chance to watch a monolithic service reach the point where it had to be split up into microservices. When I started we had about thirty software engineers contributing to our service. We had the great idea of developing a framework to speed up feature delivery for everyone in Alexa Shopping. The framework ended up working and the number of people contributing code to our service shot up over the next two years. As of July 2021 we had around 200-300 people in our partner support slack channel.
What happened is that we gradually spent more and more time supporting tests in our CD pipeline. First we had an oncall who did operational support and pipeline release support. Then once that person got swamped and complaints about release frequency got louder we added a second person to operational support. Then we had one person doing releases and outages with a second doing partner pull requests and office hours.
Growth continued during this time and we attempted a number of changes to federate features and split out responsibility. We split end to end tests into separate suites so it would be easier to find out who knew what a feature was supposed to be doing. This helped a lot, prior to federating the test suites our oncalls would spend a lot of time deep diving end to end tests to figure out what was going on. Afterwards it was a lot easier to find out who we could ask for help.
One thing that was a huge failure was expecting other teams to debug and fix their end to end tests in the pipeline. Typically, team A has a launch deadline and it asking us to deploy now. We get their code to the staging environment and we see test failure from teams B, C, and D. Teams B, C and D do not have a launch deadline so they are not prioritizing fixing their end to end tests.
Another big failure was splitting each teams rules into a different repository. We ended up with ten repositories with one relevant file in each. It could have just been a folder in the original project. Plus it was a lot harder to figure out where to put things with ten different projects. The nail in the coffin from my perspective was that the rules were still deployed together.
One final issue was integration tests. We wanted people to write integration tests, but everybody (including myself) avoided it as much as possible. The reality was the DSL for our end to end tests was significantly better than for integration tests. It was hard to reason how you were testing your specific feature at the API level. But in the end all our feature changes were directly testable at the end to end test level. It was just a lot quicker and easier to write an end to end test than API level test.
Finally, we reached the point where we had to split up the monolith purely because the pipeline was blocking too many teams. It was a risk because in the short term we knew it would increase operational overhead and delay needed upgrades. Unfortunately, the team lost about 50% of its people especially the most experienced ones. And I was one of them so I don’t know how things ended up.