Argus icon indicating copy to clipboard operation
Argus copied to clipboard

Stress tests for incident reporting API

Open lunkwill42 opened this issue 2 years ago • 2 comments

Our production deployment tends to get choked up when there are storms of incidents being posted to it. I.e. the k8s load balancer returns a "502 Bad Gateway" when the backend no longer is processing the incoming requests.

This is usually solved in the production deployment by adding more replicas for the load balancer to route requests to, but it is unclear whether the problem is in the load-balancing/scheduling mechanisms of K8s itself, or if Argus has a real problem with handling large amounts of requests.

I therefore submit that we need to add an integration test to stress test the external API, by submitting large amounts of test incidents and verifying that they are actually completely persisted to the database.

(There are also some indications that we may have transactional problems in the incident endpoint, as some clients report their incidents are saved, but the tags that were submitted in the same request payload are not).

lunkwill42 avatar Jan 10 '23 12:01 lunkwill42

  • [ ] Send fake incidents (tagged as stress test) in an infinite loop. Every x seconds or for y minutes or until aborted.
  • [ ] Test that the incidents are complete

hmpf avatar Apr 17 '23 08:04 hmpf

  • [ ] The stresstest command should produce a number of incidents stored per second (an average for the runtime)

lunkwill42 avatar Sep 18 '23 07:09 lunkwill42