[This is a draft from my in progress handbook on running software in production]
Load testing helps us estimate how much load our system can handle. This helps us decide how much hardware we need, whether we need to switch to a more performant architecture, and prepare for a peak event like a big sale.
Natural vs Synthetic traffic
When we load test the traffic has to come from somewhere. If we want to load test in preparation for a 10x peak of traffic during a sale, we need to get 10x our regular traffic from somewhere. An easy way to do it is to write a script that makes request to our website. This is called synthetic traffic, its typically very predictable or targets a subset of the systems functionality. Developing synthetic traffic is pretty expensive, but it is also easy, you just write some code and prepare some test accounts.
Natural traffic is real traffic generated by customers who want to use your project. This traffic is great for testing your service because it is a 100% real example of your load. The issue with natural traffic is that it cannot be scaled easily. If you have non-peak load of 100 req/s (requests per second) in a load test you are typically preparing for a situation where you get 1000 req/s. Do you really want to use customers to test if your service works with 10x as many users? No, because that is expensive and you’d rather have those customers buy things on a working site.
Squeeze tests
A squeeze test is a way to use natural traffic to load test your services. You modify your load balancer to gradually route all of your traffic to a single host. This only works if you run a multi-host system and has the risk that you cause an outage for actual customers. The last limitation is that you are limited in how much load you can put onto that host by your current natural traffic. If you have 100req/s natural traffic a squeeze test will only allow you to test 100 req/s on a single host. Thats great if you need 3 hosts to serve 100 req/s but if one host can handle 200 req/s a squeeze test doesn’t help. At bigger scales squeeze tests are very effective for evaluating how much a host can handle.
Replay Traffic
Replay traffic is when you record actual customer requests and then replay them later to load test your services. You do need a way to flag replay traffic to ensure it doesn’t actually change customer accounts or place orders. This can be done with an extra field that basically says “replay:true” or you can do replay tests in a beta environment that is distinct from production.
Artificial Traffic
Artificial traffic is generated programmatically. It can be easy to get started writing artificial traffic, but the issue is typically covering all of your test cases. Developing artificial traffic is easy at first, but you quickly end up with lots of code to model all the possible uses of your system.
You would think that if you had integration or end-to-end tests you could just run them really fast for load testing. If you are in that situation take advantage of it, but often times load testing has slightly different needs than integration tests that verify functionality. One problem that pops up when you try to repurpose your integration tests is that you need more user accounts to hit your req/s target.
Test Accounts
These are just artificially made accounts that are used as the identity of scripted artificial traffic. Many systems have different behavior for customers when they are logged in vs using the website anonymously. Depending on what you are preparing for you might want to focus on one or the other. You may also want to configure your test accounts in various ways.
At my first job we ‘created’ test accounts by filling out an excel spreadsheet which the accounts we needed and sending it to the IT department to actually create the accounts. Unfortunately, the IT department didn’t have coverage to add all of the properties we needed. So we still had to manually configure our test accounts once we got them.
Test accounts sometimes expire. At my current job test accounts usually expire after 30 days. We use these accounts constantly, so account expiration meant our integration tests were failing every week on some new account that had expired. Keep your eye on expiration times, you don’t want to spend weeks preparing 10,000 test accounts for the big load test, and then have them all expire because the date was pushed back a week.
Test account pools and generators
Having pools of test accounts is a common pattern. These are the accounts we prepared for feature A, those are the accounts for feature B, etc. Its a good strategy, but can get away from you as you start to have dozens of account pools for different purposes.
Test account generators are either scripts or preferably APIs that create test accounts on request. Ideally, you want the ability to create a test account with any set of attributes your website supports. A big trap for these systems is when they allow you to create account programmatically, except for xyz functionality which has to be configured by hand. The goal is to be able to setup accounts automatically on demand. If you can do that test accounts can become ephemeral and you don’t have to worry about them expiring.
When to load test
The best times to load test are in preparation for peak events and when launching new features.
Peak events are times that you know your system will experience high traffic. It could be a marketing campaign launching on March 14th. It could be that you have seasonal traffic leading up to the 4th of July. You might have a new enterprise client onboarding 10,000 employees next month.
Feature launches can effect the performance of your system. Doing a bit of load testing before launching new functionality is useful to avoid gotchas.
Running Load Tests
Depending on which type of load test you are running you will want to prepare differently. For a lightweight test like a squeeze test, you might just configure it to happen in your deployment pipeline. If its a big test for peak you might want to organize a Gameday.
Gamedays
For big load tests you want to organize a bit ahead of time. Its important to notify people who are oncall or supporting the website so that they know whats going on. If you don’t let them know, people may end up panicking trying to find out where all this load is coming from. “Are we under a DDOS attack?!”
You want to avoid running your big load tests during busy times for your website. That might mean the load test happens at night. Try not to start too late or your people will be working after midnight.
Load tests like this will often find the weakest link in the system. If Service C is an essential part of the chain and can only handle 200 req/s the load test can’t go past that. So you may end up hitting a failure point, ending the load test early and going back to fix the weak point. Then you need to setup another Gameday.