terminus icon indicating copy to clipboard operation
terminus copied to clipboard

Graceful shutdown isn't working like expected

Open Lp-Francois opened this issue 11 months ago • 7 comments

Is there an existing issue for this?

  • [x] I have searched the existing issues

Current behavior

This implementation is incorrect: https://github.com/nestjs/terminus/issues/2421 (to read first) & https://github.com/nestjs/terminus/pull/2422

  1. on receiving SIGTERM signal, set readiness probe to fail with 503, to tell the orchestrator to stop sending requests
  2. Wait X seconds to be sure traffic stops being forwarded to the app by Kubernetes (should match the interval of the readiness probe + few seconds, to be sure the orchestrator is aware the pod should stop receive traffic),
  3. proceed to close the webserver (process last requests if there are still some long ones running)
  4. proceed to close database connections and others connections & shutdown the app

Image

Minimum reproduction code

Load test your NestJS app running in a Kubernetes environment, and trigger a new deployment during this load test. You should notice a few failed requests.

Here is a simple example of load test you can run with k6:

cat << 'EOF' | k6 run -
import http from 'k6/http';
import { sleep } from 'k6';

export const options = {
  scenarios: {
    constant_request_rate: {
      executor: 'constant-arrival-rate',
      rate: 5,                // 5 iterations per second
      timeUnit: '1s',         // 1 second
      duration: '2m',         // 2 minutes
      preAllocatedVUs: 5,     // Number of VUs to pre-allocate
      maxVUs: 10,            // Maximum number of VUs to allow if needed
    },
  },
};

export default function () {
  http.get('https://your-endpoint.com/livez');
  sleep(1);
}
EOF

Steps to reproduce

No response

Expected behavior

The expected graceful shutdown behaviour from a production-ready NestJs app should be:

  1. on receiving SIGTERM signal, ~set readiness probe to fail with 503, to tell the orchestrator to stop sending requests~
  2. Wait X seconds to be sure traffic stops being forwarded to the app by Kubernetes
  3. set readiness probe to fail with 503, to tell the orchestrator to stop sending requests
  4. proceed to close the webserver (process last requests if there are still some long ones running)
  5. proceed to close database connections and others connections & shutdown the app

Image

Therefore, if the loadbalancer is still sending a request before being aware the endpoint is removed, the requests won't we seen as failed with 502, but instead will still be processed and not lead to downtime during a rolling update.

Package version

latest

NestJS version

latest

Node.js version

latest

In which operating systems have you tested?

  • [x] macOS
  • [ ] Windows
  • [ ] Linux

Other

Resources that explains why the few seconds sleep is necessary: https://learnk8s.io/graceful-shutdown

In the meantime, simply setting a sleep to 0s in Terminus, and adding a lifecycle preStop hook to sleep X sec is enough to fix the behaviour.

Lp-Francois avatar Jan 17 '25 10:01 Lp-Francois

Would you like to create a PR? :)

BrunnerLivio avatar Jan 19 '25 08:01 BrunnerLivio

I can give it a try :) !

How can I test the updated package with one of the sample examples? Do I need to publish locally first? @BrunnerLivio

Lp-Francois avatar Jan 20 '25 10:01 Lp-Francois

up @BrunnerLivio :)

I couldn't find an answer in the README or CONTRIBUTING.md. Also seems like there are dead links to https://github.com/nestjs/terminus/blob/master/docs/DEVELOPER.md

Lp-Francois avatar Jan 21 '25 16:01 Lp-Francois

@Lp-Francois You’re right I should update the CONTRIBUTING & README. Basically, you can npm build and then npm link to make it linkable to any node project. So you can to any project and just run npm link @nestjs/terminus.

Using npm run build:all you can build all the samples, if you wanna work with the samples folder to test things. You just need to do that once. After npm build in the root of the project should suffice (it should re-link all the samples with the newly built files)

BrunnerLivio avatar Jan 21 '25 22:01 BrunnerLivio

Update: I didn't find any time to work on a PR for this issue. It might come in the future, but I cannot tell when

Lp-Francois avatar Apr 22 '25 15:04 Lp-Francois

@Lp-Francois @BrunnerLivio I created PR to address this issue

mag123c avatar Aug 07 '25 12:08 mag123c

@Lp-Francois @BrunnerLivio I created PR to address this issue

Thanks, I didn't find the time to work on it :)

Lp-Francois avatar Aug 07 '25 14:08 Lp-Francois