renku-python feat(service): horizontal scaling

/deploy renku-gateway=master renku-ui=update-renku-version-endpoint #persist

Oct 19 '22 20:10 olevski

You can access the deployment of this PR at https://renku-ci-rp-3178.dev.renku.ch

Oct 19 '22 21:10 RenkuBot

We have to set resource requests for all containers in the core service so that the pod autoscaler works.

I prospose the following resource requests:

core: 4Gi
core-datasets-workers: 2Gi
core-management-workers: 100Mi
core-scheduler: 100Mi
traefik: 100Mi (no historical data here for this specifically, this is based on the data from the gateway traefik instance)

I looked at the historical memory consumption in Gi over the last 90 days from renkulab.io.

core

core-datasets-workers

core-mangement-workers

core-scheduler

Oct 20 '22 00:10 olevski

Cpu usage is fairly low. And IMO we should not really bother including this in the hpa.

Oct 20 '22 00:10 olevski

What do these changes mean:

Routing is more complicated to maintain/get the sticky cookies.
The pod disruption budget prevents an admin from draining nodes/doing other things and accidentally fully removing all instances of the core service. With this this pdb for example an admin (or similar user) will be prevented from draining a node that would bring the number of replicas below 1.
The horizontal pod autoscaler will aim to keep the memory utilization of the core service pod at 50%. The utilization is calculated as sum of all container memory usage / sum of all container memory requests in the pod. This will scale up and down accordingly but never below 2 replicas. The 2 replicas help with the pod disruption budget setting. Setting minimum replicas and pdb to 1 means that the admin cannot evict your service and needs to come talk to you before doing so.
The update strategy is now "Rolling" with a miniumum unavailable pods at 0 and a surge of 1. This means that during the update the number of replicas that are available will be maintained and the update will be done 1 replica at a time. So if you have 2 replicas with 0 min unavailable and 1 surge, then the update process adds an extra new replica, waits for the extra to become available, then kills an old replica, adds a new extra replica, waits for that to become available and finally kills the last old replica. However during this time you will have replicas that run different versions of the code.
Added tini in the docker container which properly handles the interruption signal that k8s sends to the pod and containers and forwards this to all processes. Otherwise k8s sends the signal and after 30 seconds forcefully removes stuff. Tini just makes sure the signal reaches all processes running in a container.

Oct 20 '22 12:10 olevski

This is how the routing changes:

Current:

flowchart LR
        Browser
    subgraph Ingress [Ingress]
        IngressRenku[http://renkulab.io/ui-server/api/renku]
    end
    subgraph k8s[k8s cluster]
        UI[UI-server]
        subgraph Gateway
            GatewayTraefik[Gateway traefik]
            GatewayAuth[Gateway-auth]
        end
        subgraph CoreSvc[Core Service Pod]
            Core
        end
    end
    Browser -- 1 --> IngressRenku
    IngressRenku -- 2 --> UI
    UI -- 3 --> GatewayTraefik
    GatewayTraefik -- 4 --> GatewayAuth
    GatewayAuth -- 5 --> GatewayTraefik
    GatewayTraefik -- 6 --> Core

New

flowchart LR
        Browser
    subgraph Ingress [Ingress]
        IngressRenku[http://renkulab.io/ui-server/api/renku]
        IngressCore[http://renkulab.io/api/renku]
    end
    subgraph k8s[k8s cluster]
        UI[UI-server]
        subgraph Gateway
            GatewayTraefik[Gateway traefik]
            GatewayAuth[Gateway-auth]
        end
        subgraph CoreSvc[Core Service Pod]
            Core
            Traefik
        end
    end
    Browser -- 1 --> IngressRenku
    IngressRenku -- 2 --> UI
    UI -- 3 --> GatewayTraefik
    GatewayTraefik -- 4 --> IngressCore
    IngressCore -- 5 --> Traefik
    Traefik -- 6 --> GatewayAuth
    GatewayAuth -- 7 --> Traefik
    Traefik -- 8 --> Core

The gateway uses traefik to do the routing. And traefik cannot assign sticky session cookies. It only sees the address for the k8s service and the round robin load balancing the k8s service does is not known to traefik. But the k8s ingress does know what actual replica will the k8s service use and can assign the sticky session cookie. That is why we need to go through the ingress now to get the sticky sessions to work.

And in the new version of the routing the core service's traefik container has to go to the gateway to get authenticated/exchange the JWT for any other token it needs.

Oct 20 '22 13:10 olevski

Results from load testing:

Migrations

I did it with a 1 year old project (that does not have a lot of commits)
I could not find a better project or a more complicated one. I tried https://dev.renku.ch/gitlab/mohammad.alisafaee/old-datasets-v0.6.0-with-submodules but it said it requires manual migration
30 concurrent migrations on my 1 year old simple project ran with these response times: avg=1.35s min=4.59ms med=365.41ms max=20.69s p(90)=3.58s p(95)=5.33s, all 30 migrations complete successfully
running the same thing on dev.renku.ch produced the following response times: avg=3.05s min=4.48ms med=1.04s max=18.81s p(90)=9.17s p(95)=12.68s, all migrations completed successfully - gitlab had some instability when forking but the migrations had no issues

File uploads

10 concurrent uploads each uploading 100MB file
most often only half of the uploads succeed
sometimes when you get lucky all will succeed
this also has the problem of the memory leak but this was not crashing the uploads - something else was
on dev this is how long the requests took avg=2.18s min=0s med=2.34s max=59.99s p(90)=2.94s p(95)=3.14s
on this ci deployment this is how long the requests took avg=390.62ms min=3.73ms med=384.41ms max=2.79s p(90)=591.64ms p(95)=671.37ms
the tests definitely took considerably longer to finish on dev.renku.ch than on the ci and the response times confirm this too

Oct 25 '22 00:10 olevski

This is great @olevski ! Really glad to see this as it will make updates much smoother!

Oct 31 '22 10:10 aledegano

Pull Request Test Coverage Report for Build 5313354459

33 of 36 (91.67%) changed or added relevant lines in 6 files are covered.
24 unchanged lines in 8 files lost coverage.
Overall coverage decreased (-0.01%) to 85.941%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
renku/ui/service/views/versions_list.py	10	11	90.91%
renku/ui/service/controllers/versions_list.py	9	11	81.82%
<!--	Total:	33	36

Files with Coverage Reduction	New Missed Lines	%
renku/core/dataset/providers/dataverse.py	1	64.76%
renku/infrastructure/repository.py	1	81.41%
renku/ui/cli/init.py	1	96.9%
renku/ui/service/controllers/project_lock_status.py	1	92.5%
renku/command/rollback.py	2	78.7%
renku/core/util/git.py	2	85.66%
renku/core/dataset/context.py	3	91.43%
renku/ui/cli/service.py	13	62.69%
<!--	Total:	24

Totals
Change from base Build 5292482422:	-0.01%
Covered Lines:	25846
Relevant Lines:	30074

💛 - Coveralls

May 02 '23 10:05 coveralls

@Panaetius this is good to go. I cannot approve because I opened the PR in the first place.

Jun 01 '23 08:06 olevski

Does this require refreshing https://github.com/SwissDataScienceCenter/renku-ui/pull/2134 ?

Jun 19 '23 15:06 lorenzo-cavazzi

Does this require refreshing SwissDataScienceCenter/renku-ui#2134 ?

no. While the versions list is not served by nginx anymore but the individual core svc, the content/URL shouldn't have change (/api/renku/versions will just go to/redirect to /api/renku/v10/2.0/versions)

Jun 20 '23 09:06 Panaetius

renku-python renku-python copied to clipboard

feat(service): horizontal scaling

Migrations

File uploads

Pull Request Test Coverage Report for Build 5313354459

💛 - Coveralls

renku-python
renku-python copied to clipboard