renku-python
renku-python copied to clipboard
feat(service): horizontal scaling
/deploy renku-gateway=master renku-ui=update-renku-version-endpoint #persist
You can access the deployment of this PR at https://renku-ci-rp-3178.dev.renku.ch
We have to set resource requests for all containers in the core service so that the pod autoscaler works.
I prospose the following resource requests:
- core: 4Gi
- core-datasets-workers: 2Gi
- core-management-workers: 100Mi
- core-scheduler: 100Mi
- traefik: 100Mi (no historical data here for this specifically, this is based on the data from the gateway traefik instance)
I looked at the historical memory consumption in Gi over the last 90 days from renkulab.io.
core
core-datasets-workers
core-mangement-workers
core-scheduler
Cpu usage is fairly low. And IMO we should not really bother including this in the hpa.
What do these changes mean:
- Routing is more complicated to maintain/get the sticky cookies.
- The pod disruption budget prevents an admin from draining nodes/doing other things and accidentally fully removing all instances of the core service. With this this pdb for example an admin (or similar user) will be prevented from draining a node that would bring the number of replicas below 1.
- The horizontal pod autoscaler will aim to keep the memory utilization of the core service pod at 50%. The utilization is calculated as sum of all container memory usage / sum of all container memory requests in the pod. This will scale up and down accordingly but never below 2 replicas. The 2 replicas help with the pod disruption budget setting. Setting minimum replicas and pdb to 1 means that the admin cannot evict your service and needs to come talk to you before doing so.
- The update strategy is now "Rolling" with a miniumum unavailable pods at 0 and a surge of 1. This means that during the update the number of replicas that are available will be maintained and the update will be done 1 replica at a time. So if you have 2 replicas with 0 min unavailable and 1 surge, then the update process adds an extra new replica, waits for the extra to become available, then kills an old replica, adds a new extra replica, waits for that to become available and finally kills the last old replica. However during this time you will have replicas that run different versions of the code.
- Added tini in the docker container which properly handles the interruption signal that k8s sends to the pod and containers and forwards this to all processes. Otherwise k8s sends the signal and after 30 seconds forcefully removes stuff. Tini just makes sure the signal reaches all processes running in a container.
This is how the routing changes:
Current:
flowchart LR
Browser
subgraph Ingress [Ingress]
IngressRenku[http://renkulab.io/ui-server/api/renku]
end
subgraph k8s[k8s cluster]
UI[UI-server]
subgraph Gateway
GatewayTraefik[Gateway traefik]
GatewayAuth[Gateway-auth]
end
subgraph CoreSvc[Core Service Pod]
Core
end
end
Browser -- 1 --> IngressRenku
IngressRenku -- 2 --> UI
UI -- 3 --> GatewayTraefik
GatewayTraefik -- 4 --> GatewayAuth
GatewayAuth -- 5 --> GatewayTraefik
GatewayTraefik -- 6 --> Core
New
flowchart LR
Browser
subgraph Ingress [Ingress]
IngressRenku[http://renkulab.io/ui-server/api/renku]
IngressCore[http://renkulab.io/api/renku]
end
subgraph k8s[k8s cluster]
UI[UI-server]
subgraph Gateway
GatewayTraefik[Gateway traefik]
GatewayAuth[Gateway-auth]
end
subgraph CoreSvc[Core Service Pod]
Core
Traefik
end
end
Browser -- 1 --> IngressRenku
IngressRenku -- 2 --> UI
UI -- 3 --> GatewayTraefik
GatewayTraefik -- 4 --> IngressCore
IngressCore -- 5 --> Traefik
Traefik -- 6 --> GatewayAuth
GatewayAuth -- 7 --> Traefik
Traefik -- 8 --> Core
The gateway uses traefik to do the routing. And traefik cannot assign sticky session cookies. It only sees the address for the k8s service and the round robin load balancing the k8s service does is not known to traefik. But the k8s ingress does know what actual replica will the k8s service use and can assign the sticky session cookie. That is why we need to go through the ingress now to get the sticky sessions to work.
And in the new version of the routing the core service's traefik container has to go to the gateway to get authenticated/exchange the JWT for any other token it needs.
Results from load testing:
Migrations
- I did it with a 1 year old project (that does not have a lot of commits)
- I could not find a better project or a more complicated one. I tried https://dev.renku.ch/gitlab/mohammad.alisafaee/old-datasets-v0.6.0-with-submodules but it said it requires manual migration
- 30 concurrent migrations on my 1 year old simple project ran with these response times:
avg=1.35s min=4.59ms med=365.41ms max=20.69s p(90)=3.58s p(95)=5.33s
, all 30 migrations complete successfully - running the same thing on dev.renku.ch produced the following response times:
avg=3.05s min=4.48ms med=1.04s max=18.81s p(90)=9.17s p(95)=12.68s
, all migrations completed successfully - gitlab had some instability when forking but the migrations had no issues
File uploads
- 10 concurrent uploads each uploading 100MB file
- most often only half of the uploads succeed
- sometimes when you get lucky all will succeed
- this also has the problem of the memory leak but this was not crashing the uploads - something else was
- on dev this is how long the requests took
avg=2.18s min=0s med=2.34s max=59.99s p(90)=2.94s p(95)=3.14s
- on this ci deployment this is how long the requests took
avg=390.62ms min=3.73ms med=384.41ms max=2.79s p(90)=591.64ms p(95)=671.37ms
- the tests definitely took considerably longer to finish on dev.renku.ch than on the ci and the response times confirm this too
This is great @olevski ! Really glad to see this as it will make updates much smoother!
Pull Request Test Coverage Report for Build 5313354459
- 33 of 36 (91.67%) changed or added relevant lines in 6 files are covered.
- 24 unchanged lines in 8 files lost coverage.
- Overall coverage decreased (-0.01%) to 85.941%
Changes Missing Coverage | Covered Lines | Changed/Added Lines | % |
---|---|---|---|
renku/ui/service/views/versions_list.py | 10 | 11 | 90.91% |
renku/ui/service/controllers/versions_list.py | 9 | 11 | 81.82% |
<!-- | Total: | 33 | 36 |
Files with Coverage Reduction | New Missed Lines | % |
---|---|---|
renku/core/dataset/providers/dataverse.py | 1 | 64.76% |
renku/infrastructure/repository.py | 1 | 81.41% |
renku/ui/cli/init.py | 1 | 96.9% |
renku/ui/service/controllers/project_lock_status.py | 1 | 92.5% |
renku/command/rollback.py | 2 | 78.7% |
renku/core/util/git.py | 2 | 85.66% |
renku/core/dataset/context.py | 3 | 91.43% |
renku/ui/cli/service.py | 13 | 62.69% |
<!-- | Total: | 24 |
Totals | |
---|---|
Change from base Build 5292482422: | -0.01% |
Covered Lines: | 25846 |
Relevant Lines: | 30074 |
💛 - Coveralls
@Panaetius this is good to go. I cannot approve because I opened the PR in the first place.
Does this require refreshing https://github.com/SwissDataScienceCenter/renku-ui/pull/2134 ?
Does this require refreshing SwissDataScienceCenter/renku-ui#2134 ?
no. While the versions list is not served by nginx anymore but the individual core svc, the content/URL shouldn't have change (/api/renku/versions
will just go to/redirect to /api/renku/v10/2.0/versions
)