issuer-kit
issuer-kit copied to clipboard
issuer-kit-api - Repeated health check failures on OCP
The health check for the issuer kit api checks it's connection with the database. We're seeing a few instances having repeated health check failures (loosing connection with it's database) causing repeated pod restarts.
Affected instances:
-
a99fd4-dev
- api-openvp
-
a99fd4-test
- api-openvp
- api-bcvcpilot
-
a99fd4-prod
- api-openvp
Could be related to https://github.com/feathersjs/feathers/issues/3159
The recent instances of this issue being identified by SysDig are for agent instances that are not being moved, nor is the database. The agent starts up just fine and at some point looses it's connection to the database. In other cases it seems it may be an issue with the database pod restarting (but not in all cases). More investigation is required.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I've looked into this a bit more. Some of the issues were caused by the associated MongoDB instances getting OOM killed. I adjusted the resources here, and that helped a bit. I've also upgraded the issuer-kit components here. The latest versions are in dev
and seem to have reduced the number pod restarts.
The issues are related to this code: https://github.com/bcgov/issuer-kit/blob/9a0e22948faa19f0bf090383fcd9acfef51f1a36/api/src/app.ts#L36-L40 https://github.com/bcgov/issuer-kit/blob/9a0e22948faa19f0bf090383fcd9acfef51f1a36/api/src/mongodb.ts#L13-L27
The pod seems to loose it's connection to the database after a while. Once that happens it never reconnects, which subsequently causes the health checks to fail until the pod is restarted.
The easiest way to find the affected pods is to look at the list of pods and sort by Restarts
, then pick out the ones at the top of the list that are using the issuer-kit-api image.
- https://console.apps.silver.devops.gov.bc.ca/k8s/ns/a99fd4-dev/core~v1~Pod?orderBy=desc&sortBy=Restarts
- https://console.apps.silver.devops.gov.bc.ca/k8s/ns/a99fd4-test/core~v1~Pod?orderBy=desc&sortBy=Restarts
- https://console.apps.silver.devops.gov.bc.ca/k8s/ns/a99fd4-prod/core~v1~Pod?orderBy=desc&sortBy=Restarts
Pods:
- api-openvp
- api-openvp-candy
- api-bcvcpilot
- api-a2a
Note the agent-a2a
and api-openvp
pods in dev
have been running fine (without restarts) since being upgraded yesterday, while the other issuer-kit-api
based pods have continued to encounter the restart issues.
Thanks @WadeBarnes . Let's see how it goes after the upgrades and re-assess. If it is MongoClient
that loses the connection and never recreates it we might have to look at filing an issue in the library repo, potentially...
Since deploying the updated version of the issuer-kit-api
image to dev
on October 11th the api-openvp-candy
pod has restarted 140 times and the api-a2a
pod has only restarted once. They are both connected to the same issuer-kit-db
instance. Both containers are configured the same with respect to resource allocation and health check configuration.
Pod related events for api-openvp-candy
:
There are none for api-a2a
The database logs are full of the api-openvp-candy
pod connecting and then almost immediately disconnecting. Once disconnected the pod restarts once it's health check cycle fails.