issuer-kit icon indicating copy to clipboard operation
issuer-kit copied to clipboard

issuer-kit-api - Repeated health check failures on OCP

Open WadeBarnes opened this issue 1 year ago • 9 comments

The health check for the issuer kit api checks it's connection with the database. We're seeing a few instances having repeated health check failures (loosing connection with it's database) causing repeated pod restarts.

Affected instances:

  • a99fd4-dev
    • api-openvp
  • a99fd4-test
    • api-openvp
    • api-bcvcpilot
  • a99fd4-prod
    • api-openvp

WadeBarnes avatar Jun 20 '23 19:06 WadeBarnes

Could be related to https://github.com/feathersjs/feathers/issues/3159

esune avatar Jun 29 '23 12:06 esune

The recent instances of this issue being identified by SysDig are for agent instances that are not being moved, nor is the database. The agent starts up just fine and at some point looses it's connection to the database. In other cases it seems it may be an issue with the database pod restarting (but not in all cases). More investigation is required.

WadeBarnes avatar Jun 29 '23 13:06 WadeBarnes

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Aug 28 '23 18:08 stale[bot]

I've looked into this a bit more. Some of the issues were caused by the associated MongoDB instances getting OOM killed. I adjusted the resources here, and that helped a bit. I've also upgraded the issuer-kit components here. The latest versions are in dev and seem to have reduced the number pod restarts.

WadeBarnes avatar Oct 12 '23 16:10 WadeBarnes

The issues are related to this code: https://github.com/bcgov/issuer-kit/blob/9a0e22948faa19f0bf090383fcd9acfef51f1a36/api/src/app.ts#L36-L40 https://github.com/bcgov/issuer-kit/blob/9a0e22948faa19f0bf090383fcd9acfef51f1a36/api/src/mongodb.ts#L13-L27

The pod seems to loose it's connection to the database after a while. Once that happens it never reconnects, which subsequently causes the health checks to fail until the pod is restarted.

WadeBarnes avatar Oct 12 '23 16:10 WadeBarnes

The easiest way to find the affected pods is to look at the list of pods and sort by Restarts, then pick out the ones at the top of the list that are using the issuer-kit-api image.

  • https://console.apps.silver.devops.gov.bc.ca/k8s/ns/a99fd4-dev/core~v1~Pod?orderBy=desc&sortBy=Restarts
  • https://console.apps.silver.devops.gov.bc.ca/k8s/ns/a99fd4-test/core~v1~Pod?orderBy=desc&sortBy=Restarts
  • https://console.apps.silver.devops.gov.bc.ca/k8s/ns/a99fd4-prod/core~v1~Pod?orderBy=desc&sortBy=Restarts

Pods:

  • api-openvp
  • api-openvp-candy
  • api-bcvcpilot
  • api-a2a

WadeBarnes avatar Oct 12 '23 16:10 WadeBarnes

Note the agent-a2a and api-openvp pods in dev have been running fine (without restarts) since being upgraded yesterday, while the other issuer-kit-api based pods have continued to encounter the restart issues.

WadeBarnes avatar Oct 12 '23 16:10 WadeBarnes

Thanks @WadeBarnes . Let's see how it goes after the upgrades and re-assess. If it is MongoClient that loses the connection and never recreates it we might have to look at filing an issue in the library repo, potentially...

esune avatar Oct 13 '23 17:10 esune

Since deploying the updated version of the issuer-kit-api image to dev on October 11th the api-openvp-candy pod has restarted 140 times and the api-a2a pod has only restarted once. They are both connected to the same issuer-kit-db instance. Both containers are configured the same with respect to resource allocation and health check configuration.

Pod related events for api-openvp-candy: image

There are none for api-a2a

The database logs are full of the api-openvp-candy pod connecting and then almost immediately disconnecting. Once disconnected the pod restarts once it's health check cycle fails.

WadeBarnes avatar Oct 16 '23 17:10 WadeBarnes