gitpod
gitpod copied to clipboard
Load test Single Cluster Reference Architcture
Summary
Scale test the production reference architecture in order to be able to give info on how far it can scale
Context
We are using our reference architectures as the way to communicate what Gitpod can and cannot do given a specific set of infrastructure. It makes sense to also talk about scale in this context.
Value
- Assurance in the scale that the prod reference architecture can handle
Acceptance Criteria
- We have a measure of how far the reference architecture can scale that is relevant to our users. E.g. it could be the number of workspaces of size X (given N nodes)
Measurement
- We have a metric that tracks scale, and we can re-run these tests to see if something has changed.
Implementation Ideas:
- You can use loadgen for that https://github.com/gitpod-io/gitpod/tree/main/dev/loadgen. We use this scenario to create 55 workspaces running some stress tasks.
Additional Context
- Internal slack message
This is likely not just a team self-hosted thing but a Gitpod thing :)
We should move this to scheduled once we have the terraform scripts ( #11027) and maybe a werft command to spin up a self-hosed environment.
👋🏼
Update:
I've worked on the EKS side of things in the single-cluster reference architecture, this week. This required working on #12577 as we need to scale the cluster based on the number of workspaces. Once that change was merged, The loadgen just worked like you would expect.
Because the images were part of gitpod-dev registry (which is a private registry), We used #12174 that was recently added to pass docker credentials into the cluster for the same. :)
Result: We started with the config in the prod-benchmark.yaml which spuns up 100 workspaces. In this, We found a success rate of around 0.95. The remaining 5 workspaces were terminated spuriously (reason being issues with node getting marked as not ready). Because these were pods, They are not re-applied automatically, and hence lost. For this scale, The autoscaler spun up 16 nodes in total. https://www.loom.com/share/b7fa5beef4134051984f4c157ae47552
@lucasvaltl suggested if we could do more i.e 500 workspaces. Will posts update here on the same.
@Pothulapati could you share a link to the loadgen config that you are using? :pray:
@kylos101 I'm using the configs in the repo, specifically the prod-benchmark.yaml
For the 500 workspaces test, I'm just running the same config but with more scores i.e number of workspaces.
Update on the 500 workspaces load suggestion:
We started out well until 250 workspaces.

But once we reached the 60m mark, A bunch of workspaces timed out. As I can't find any errors on the components, This is probably a timeout thing. So, Even though the loadgen applied 500, We ended up with only 350 running workspaces by the end.

So, Results on EKS seem pretty good! 👍🏼
👋 hey there, I am going to load test the autoscaling for GKE in https://github.com/gitpod-io/website/issues/2888. Closing this issue in favor of that one (a load test was also recently done as part of the September release for GKE and EKS).