Argus
Argus copied to clipboard
Stress tests for incident reporting API
Our production deployment tends to get choked up when there are storms of incidents being posted to it. I.e. the k8s load balancer returns a "502 Bad Gateway" when the backend no longer is processing the incoming requests.
This is usually solved in the production deployment by adding more replicas for the load balancer to route requests to, but it is unclear whether the problem is in the load-balancing/scheduling mechanisms of K8s itself, or if Argus has a real problem with handling large amounts of requests.
I therefore submit that we need to add an integration test to stress test the external API, by submitting large amounts of test incidents and verifying that they are actually completely persisted to the database.
(There are also some indications that we may have transactional problems in the incident endpoint, as some clients report their incidents are saved, but the tags that were submitted in the same request payload are not).
- [ ] Send fake incidents (tagged as stress test) in an infinite loop. Every x seconds or for y minutes or until aborted.
- [ ] Test that the incidents are complete
- [ ] The stresstest command should produce a number of incidents stored per second (an average for the runtime)