charts Frequent pod restarts for verdaccio

We are seeing frequent pod restarts when building our software project.

❱ kubectl get pods -n verdaccio
NAME                                                   READY   STATUS    RESTARTS   AGE
verdaccio-7855f98ff7-2drjp           1/1     Running   1          20h
verdaccio-7855f98ff7-l4lj8           1/1     Running   2          20h
verdaccio-7855f98ff7-wgtg2           1/1     Running   0          20h
verdaccio-sandbox-5d6959b689-z48zq   1/1     Running   0          20h
❱

There aren't any logs associated with the pod restarts and in the pod events we see that the pods are restarted due to liveness probe failures:

Events:
  Type     Reason     Age                 From     Message
  ----     ------     ----                ----     -------
  Warning  Unhealthy  52m (x16 over 20h)  kubelet  Liveness probe failed: Get "http://172.31.10.158:4873/-/ping": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  52m (x16 over 20h)  kubelet  Readiness probe failed: Get "http://172.31.10.158:4873/-/ping": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

The only significant difference from our values file is how we handled the packages section in the config:

      packages:
        '@kpn/*':
          access: $authenticated
          publish: <usermane>
          unpublish: <username>
        '**':
          access: $all
          proxy: npmjs

We also tried vertically and horizontally scaling up the pods to no avail:

    {{- if eq (requiredEnv "ENVIRONMENT") "prod" }}
    resources:
      requests:
        cpu: 250m
        memory: 256Mi
      limits:
        cpu: 1000m
        memory: 1Gi
    replicaCount: 3
    {{- end }}

These restarts wouldn't be a hassle if they didn't interrupt the build process:

npm ERR! code ETIMEDOUT
npm ERR! syscall connect
npm ERR! errno ETIMEDOUT
npm ERR! network request to http://172.31.46.209:32734/@types%2fnode/-/node-12.20.43.tgz failed, reason: connect ETIMEDOUT 172.31.46.209:32734
npm ERR! network This is a problem related to network connectivity.
npm ERR! network In most cases you are behind a proxy or have bad network settings.

Feb 09 '22 19:02 unacceptable

I should also say that we switched the service type to NodePort per https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/1695#issuecomment-735974898

Feb 09 '22 20:02 unacceptable

So I have a deployment that is suffering from pod restarts as well. The issue seems to be caused by sporadic high CPU spikes which cause the health checks to fail and which in turn cause the kubelet to restart the container. We've narrowed down the cause of the CPU spikes to certain requests coming from a specific type of client (JFrog Artifactory). However, we have not been able to figure out what is unique about these requests that causes these CPU spikes. The # of incoming requests is low so it is not an issue related to volume of requests.

Can I ask if you have metrics around CPU and if you are seeing spikes in CPU around the same time the health checks fail and the pod restarts occur? Can you share what client(s) are making requests to your Verdaccio instance and if you see a correlation between the client type (user-agent) and these pod restarts?

Feb 09 '22 21:02 edclement

@edclement we should probably follow up here and mention that we've solved this and tracked it down to an issue with bcrypt. We plan to open an issue against Verdaccio core and work on a fix for it as well.

Feb 18 '22 16:02 Splaktar

As @Splaktar mentioned, we managed to figure out the root cause of this issue for our cluster. The majority of our users authenticate using a JWT token. However, users that use JFrog Artifactory to mirror the packages from our registry must use a username/password to authenticate since Artifactory does support the use of a JWT token.

Looking at metrics for our cluster, I noticed that the requests coming from most (but not all) or our Artifactory users had high latency (a high response time to requests). The response latency could reach as high as the maximum configured 60 seconds on all incoming requests when one or more Artifactory instance was sending our cluster a higher then normal volume of requests. CPU would spike to 100% at the same time. Once the 60 second mark was hit and Verdaccio had not yet responded to the request, the load balancer would automatically close the client connection to Verdaccio and respond 502. It's also worth mentioning that Artifactory has the habit of sending batches of requests concurrently, sometimes milliseconds apart.

It turns out that the root of this issue was the number of salt rounds that was used to hash the password. We had initially been using the apache2-utils package htpasswd command which (after looking at the docs) defaults to using just 5 salt rounds. Later we switched to a different tool that used a higher number of salt rounds. It turns out that the users whose passwords were hashed using just 5 salt rounds did not cause any CPU spikes in our Verdaccio cluster. Users who had the higher number of salt rounds used when hashing their password did induce high CPU load. This makes sense to some extent given that the cost to verify the password is directly related to the number of salt rounds used to hash the password.

There is more to the story here though. It became apparent during the testing process when we discovered this issue, that when CPU for a Verdaccio pod would spike:

that Verdaccio would begin responding to requests in an almost sequential manner (FIFO)
that Verdaccio would not respond to any new incoming requests from other clients (e.g. npm, yarn, etc). This included health check requests made by k8s.

I did a little digging in the Verdaccio htpasswd auth plugin and noticed this line: return bcrypt.compareSync(passwd, hash);. This operation is not performed in an async manner and is blocking to the NodeJS event loop. There is an async version of this function that is not being used by the htpasswd plugin. The docs for the bcryptjs package state:

Note: Under the hood, asynchronisation splits a crypto operation into small chunks. After the completion of a chunk, the execution of the next chunk is placed on the back of JS event loop queue, thus efficiently sharing the computational resources with the other operations in the queue.

We plan on opening an issue to switch the bcrypt.compareSync(passwd, hash) call to bcrypt.compare(passwd, hash) to mitigate the current blocking nature when passwords are being verified.

@juanpicado - you will likely be interested in these findings.

Feb 18 '22 19:02 edclement

@edclement thanks for the detailed description, Iwas not aware to be honest there was a sync function, I´ll enable the sync warnings from Node.js from now on to detect this kind of cases, I´ll check it out :)

Feb 18 '22 19:02 juanpicado

I meant this https://nodejs.org/api/cli.html#--trace-sync-io just in case you can also use it to find more sync places around verdaccio, but is very noisy.

Feb 18 '22 20:02 juanpicado

running via helm with image 5.8.0 and getting same issue w/ 3 replicas. Because of failed liveliness probes which will be configurable soon. I suspect the npm runtime on the docker image itself may need tweaking or there is more synchronous code.

Mar 15 '22 23:03 ashtonian

charts charts copied to clipboard

Frequent pod restarts for verdaccio

charts
charts copied to clipboard