issuer-kit issuer-kit-api - Repeated health check failures on OCP

The health check for the issuer kit api checks it's connection with the database. We're seeing a few instances having repeated health check failures (loosing connection with it's database) causing repeated pod restarts.

Affected instances:

a99fd4-dev
- api-openvp
a99fd4-test
- api-openvp
- api-bcvcpilot
a99fd4-prod
- api-openvp

Jun 20 '23 19:06 WadeBarnes

Could be related to https://github.com/feathersjs/feathers/issues/3159

Jun 29 '23 12:06 esune

The recent instances of this issue being identified by SysDig are for agent instances that are not being moved, nor is the database. The agent starts up just fine and at some point looses it's connection to the database. In other cases it seems it may be an issue with the database pod restarting (but not in all cases). More investigation is required.

Jun 29 '23 13:06 WadeBarnes

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Aug 28 '23 18:08 stale[bot]

I've looked into this a bit more. Some of the issues were caused by the associated MongoDB instances getting OOM killed. I adjusted the resources here, and that helped a bit. I've also upgraded the issuer-kit components here. The latest versions are in dev and seem to have reduced the number pod restarts.

Oct 12 '23 16:10 WadeBarnes

The issues are related to this code: https://github.com/bcgov/issuer-kit/blob/9a0e22948faa19f0bf090383fcd9acfef51f1a36/api/src/app.ts#L36-L40 https://github.com/bcgov/issuer-kit/blob/9a0e22948faa19f0bf090383fcd9acfef51f1a36/api/src/mongodb.ts#L13-L27

The pod seems to loose it's connection to the database after a while. Once that happens it never reconnects, which subsequently causes the health checks to fail until the pod is restarted.

Oct 12 '23 16:10 WadeBarnes

The easiest way to find the affected pods is to look at the list of pods and sort by Restarts, then pick out the ones at the top of the list that are using the issuer-kit-api image.

https://console.apps.silver.devops.gov.bc.ca/k8s/ns/a99fd4-dev/core~v1~Pod?orderBy=desc&sortBy=Restarts
https://console.apps.silver.devops.gov.bc.ca/k8s/ns/a99fd4-test/core~v1~Pod?orderBy=desc&sortBy=Restarts
https://console.apps.silver.devops.gov.bc.ca/k8s/ns/a99fd4-prod/core~v1~Pod?orderBy=desc&sortBy=Restarts

Pods:

api-openvp
api-openvp-candy
api-bcvcpilot
api-a2a

Oct 12 '23 16:10 WadeBarnes

Note the agent-a2a and api-openvp pods in dev have been running fine (without restarts) since being upgraded yesterday, while the other issuer-kit-api based pods have continued to encounter the restart issues.

Oct 12 '23 16:10 WadeBarnes

Thanks @WadeBarnes . Let's see how it goes after the upgrades and re-assess. If it is MongoClient that loses the connection and never recreates it we might have to look at filing an issue in the library repo, potentially...

Oct 13 '23 17:10 esune

Since deploying the updated version of the issuer-kit-api image to dev on October 11th the api-openvp-candy pod has restarted 140 times and the api-a2a pod has only restarted once. They are both connected to the same issuer-kit-db instance. Both containers are configured the same with respect to resource allocation and health check configuration.

Pod related events for api-openvp-candy:

There are none for api-a2a

The database logs are full of the api-openvp-candy pod connecting and then almost immediately disconnecting. Once disconnected the pod restarts once it's health check cycle fails.

Oct 16 '23 17:10 WadeBarnes

issuer-kit issuer-kit copied to clipboard

issuer-kit-api - Repeated health check failures on OCP

issuer-kit
issuer-kit copied to clipboard