gitpod Load test Single Cluster Reference Architcture

Summary

Scale test the production reference architecture in order to be able to give info on how far it can scale

Context

We are using our reference architectures as the way to communicate what Gitpod can and cannot do given a specific set of infrastructure. It makes sense to also talk about scale in this context.

Value

Assurance in the scale that the prod reference architecture can handle

Acceptance Criteria

We have a measure of how far the reference architecture can scale that is relevant to our users. E.g. it could be the number of workspaces of size X (given N nodes)

Measurement

We have a metric that tracks scale, and we can re-run these tests to see if something has changed.

Implementation Ideas:

You can use loadgen for that https://github.com/gitpod-io/gitpod/tree/main/dev/loadgen. We use this scenario to create 55 workspaces running some stress tasks.

Additional Context

Internal slack message

Jun 23 '22 13:06 lucasvaltl

This is likely not just a team self-hosted thing but a Gitpod thing :)

Jun 23 '22 13:06 lucasvaltl

We should move this to scheduled once we have the terraform scripts ( #11027) and maybe a werft command to spin up a self-hosed environment.

Jul 06 '22 09:07 lucasvaltl

👋🏼

Update:

I've worked on the EKS side of things in the single-cluster reference architecture, this week. This required working on #12577 as we need to scale the cluster based on the number of workspaces. Once that change was merged, The loadgen just worked like you would expect.

Because the images were part of gitpod-dev registry (which is a private registry), We used #12174 that was recently added to pass docker credentials into the cluster for the same. :)

Result: We started with the config in the prod-benchmark.yaml which spuns up 100 workspaces. In this, We found a success rate of around 0.95. The remaining 5 workspaces were terminated spuriously (reason being issues with node getting marked as not ready). Because these were pods, They are not re-applied automatically, and hence lost. For this scale, The autoscaler spun up 16 nodes in total. https://www.loom.com/share/b7fa5beef4134051984f4c157ae47552

@lucasvaltl suggested if we could do more i.e 500 workspaces. Will posts update here on the same.

Sep 02 '22 11:09 Pothulapati

@Pothulapati could you share a link to the loadgen config that you are using? :pray:

Sep 02 '22 12:09 kylos101

@kylos101 I'm using the configs in the repo, specifically the prod-benchmark.yaml

For the 500 workspaces test, I'm just running the same config but with more scores i.e number of workspaces.

Sep 02 '22 12:09 Pothulapati

Update on the 500 workspaces load suggestion:

We started out well until 250 workspaces.

But once we reached the 60m mark, A bunch of workspaces timed out. As I can't find any errors on the components, This is probably a timeout thing. So, Even though the loadgen applied 500, We ended up with only 350 running workspaces by the end.

Screenshot 2022-09-02 at 7 12 17 PM

So, Results on EKS seem pretty good! 👍🏼

Sep 02 '22 13:09 Pothulapati

👋 hey there, I am going to load test the autoscaling for GKE in https://github.com/gitpod-io/website/issues/2888. Closing this issue in favor of that one (a load test was also recently done as part of the September release for GKE and EKS).

Oct 17 '22 21:10 kylos101

gitpod gitpod copied to clipboard

Load test Single Cluster Reference Architcture

Summary

Context

Value

Acceptance Criteria

Measurement

Implementation Ideas:

Additional Context

gitpod
gitpod copied to clipboard