Pipelines should automatically reject change-sets that fail the integration tests.

CI/CD is magical when it works. Write your code, open a pull request, get approvals and merge. Then the pipeline takes care of everything from then on. Integration tests are run end to end and UI tests pass. Finally, your change is deployed via a canary release into production.

My team runs a full CI/CD pipeline, after a Pull Request is approved and merged it is deployed through several stages of integration tests and deployed to production. This pipeline gets a lot of use with around 10 engineers working on it at any given time. Unfortunately, because of how our infrastructure works, integration test failures often break the pipeline.

Changes do not proceed through the pipeline independently, instead commits build up in front of slower parts of the pipeline, particularly our integration and end-to-end tests. During busy times several pull requests will be working their way through the pipeline together.

Our integration tests can fail for many reasons other than code changes. A partner team could have a failed deployment that is upstream of us, a configuration file could have been changed, a test account could have expired, and more. When the pipeline does fail, we have to track down what caused the failure. If it was a code change we will revert the commit, otherwise we typically have to work with another team to get their systems working.

When a pull request breaks the integration or end-to-end test the pipeline will block at that stage. Then we will revert the changes and redeploy the master branch. In the worst cases several commits will have to be reverted and redeployed into the pipeline before it will start working again.

Fixing the pipeline will often take an entire day as the fix has to move through each stage in the pipeline until it reaches the one that failed. Then there will be a flood of changes as all the commits that built up at the failing stage move onward.

I would like to fix this, but doing so will likely make the pipeline slower on average or require changes to how our Pull Request workflow functions.