nest icon indicating copy to clipboard operation
nest copied to clipboard

Disgraceful shutdown, request before signal interrupted with ECONNREFUSED

Open soryy708 opened this issue 1 year ago • 25 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Current behavior

When the application is shutdown (e.g. with SIGTERM shutdown hook), requests that started before shutdown fail with ECONNREFUSED.

Minimum reproduction code

https://github.com/soryy708/nestjs-disgraceful-shutdown-mcve

Steps to reproduce

  1. Install dependencies by running yarn install
  2. Run the tests by running yarn test

Expected behavior

Requests that started before shutdown should succeed, only new requests after shutdown should fail.

Package

  • [ ] I don't know. Or some 3rd-party package
  • [X] @nestjs/common
  • [X] @nestjs/core
  • [ ] @nestjs/microservices
  • [X] @nestjs/platform-express
  • [ ] @nestjs/platform-fastify
  • [ ] @nestjs/platform-socket.io
  • [ ] @nestjs/platform-ws
  • [X] @nestjs/testing
  • [ ] @nestjs/websockets
  • [ ] Other (see below)

Other package

No response

NestJS version

9.3.12

Packages versions

{
"@nestjs/axios": "^2.0.0",
"@nestjs/common": "^9.3.12",
"@nestjs/core": "^9.3.12",
"@nestjs/platform-express": "^9.3.12",
"@nestjs/terminus": "^9.2.2",
"@nestjs/testing": "^9.3.12",
}

Node.js version

16.19.0

In which operating systems have you tested?

  • [X] macOS
  • [ ] Windows
  • [ ] Linux

Other

No response

soryy708 avatar Apr 02 '23 13:04 soryy708

I believe the error lies somewhere here or in one these methods. Somewhere in one of these, the request is being stopped

Screenshot from 2023-04-05 19-49-50

josephaw1022 avatar Apr 05 '23 23:04 josephaw1022

Screenshot from 2023-04-06 15-18-11

So, I could be wrong, but what I have found is that the callDestroyHook method is iteratively calling callModuleDestroyHook and that is what is removing the modules via the shift method on the list of providers. So, what my intuition is telling me, is that the provider is being removed while the controller is utilizing it and thus the request is just being stopped. Again, I could be wrong, but I think this is a potential root cause.

josephaw1022 avatar Apr 06 '23 19:04 josephaw1022

Hello, what's the update on this issue??

Ekott2006 avatar Jun 08 '23 07:06 Ekott2006

I think that the problem lies on the express adapter itself, on the close method.

  public close() {
    this.closeOpenConnections();

    if (!this.httpServer) {
      return undefined;
    }
    return new Promise(resolve => this.httpServer.close(resolve));
  }

As you can see, this close method closes all the openConnections before calling this.httpServer.close.

As you can see in the express documentation (https://expressjs.com/en/advanced/healthcheck-graceful-shutdown.html), the close method on the httpServer is supposed to wait for all connections to be terminated. So technically I think that if we remove this.closeOpenConnections(), it would just work as expected.

https://github.com/nestjs/nest/blob/master/packages/platform-express/adapters/express-adapter.ts#L159-L166

coderhammer avatar Jun 17 '23 09:06 coderhammer

cc @tolgap 👀

@coderhammer this method call won't have any impact if forceCloseConnections is false (which is the default)

kamilmysliwiec avatar Jun 17 '23 09:06 kamilmysliwiec

Honestly I'm not sure, I just tried on a new project and it closes open connections with the default. Here is the current implementation of the closeOpenConnections method:

  private closeOpenConnections() {
    for (const socket of this.openConnections) {
      socket.destroy();
      this.openConnections.delete(socket);
    }
  }

https://github.com/nestjs/nest/blob/f676276979e651f6258b5bc42666e9aa7fe3dbf4/packages/platform-express/adapters/express-adapter.ts#LL429C1-L434C4

A quick fix for me for now is to implement my own OnModuleDestroy function to wait for all the connections to be closed. This will wait before calling nest's implementation on the http adapter.

import { Injectable, Logger, OnModuleDestroy } from '@nestjs/common';
import { HttpAdapterHost } from '@nestjs/core';

@Injectable()
export class AppService implements OnModuleDestroy {
  constructor(private readonly httpAdapter: HttpAdapterHost) {}

  async onModuleDestroy() {
    Logger.debug('Module destroyed called!');
    const httpServer = this.httpAdapter.httpAdapter.getHttpServer();
    await new Promise((resolve) => httpServer.close(resolve));
    Logger.debug('Waited for all connections to be closed!');
  }
}

EDIT: The behaviour is happening when activating app.enableShutdownHooks() I don't know if this is relevant

coderhammer avatar Jun 17 '23 09:06 coderhammer

Honestly I'm not sure, I just tried on a new project and it closes open connections with the default. Here is the current implementation of the closeOpenConnections method

this.openConnections will always be empty unless trackOpenConnections (https://github.dev/nestjs/nest/blob/da27730cd4ef37fdf7f29aafee53ed8ac2a8b4b8/packages/platform-express/adapters/express-adapter.ts#L421) is called and it's executed only if forceCloseConnections is true https://github.dev/nestjs/nest/blob/da27730cd4ef37fdf7f29aafee53ed8ac2a8b4b8/packages/platform-express/adapters/express-adapter.ts#L234

kamilmysliwiec avatar Jun 17 '23 09:06 kamilmysliwiec

I'm so sorry, it works exactly as expected when trying again on a fresh project. I was trying something with a special configuration. Have a nice day! This is not an issue for me anymore.

coderhammer avatar Jun 17 '23 10:06 coderhammer

@coderhammer what did you change? Is there a workaround?

soryy708 avatar Jun 17 '23 21:06 soryy708

Honestly, everything works fine on my side when testing manually. I tried your reproduction repo but I think that it's more a test configuration that is failing :/

coderhammer avatar Jun 18 '23 09:06 coderhammer

@coderhammer If you think the issue is with test configuration, where do you think the issue is?

I thought it could be that Test.createTestingModule from @nestjs/testing is bugged, so I added a test suite that doesn't use it and instead uses NestFactory from @nestjs/core, but the bug still happens.

I don't think it's feasible to reproduce this manually (without an automated test), because the timeframe in which this bug occurs is very small, but is enough to be experienced by users in a production system under load.

soryy708 avatar Jun 18 '23 17:06 soryy708

Also I don't think that app.enableShutdownHooks is related, because in the MCVE tests I linked if I comment out enableShutdownHooks it still fails with the same error. (clearly that means my MCVE wasn't minimal. Whoops!)

soryy708 avatar Jun 18 '23 17:06 soryy708

After investigating a little bit on your test repo, it seems to me that the problem comes from supertest itself. I'll try to make a reproduction later this week

coderhammer avatar Jun 19 '23 18:06 coderhammer

@coderhammer Looks like Supertest may be related, but is not the only issue. I added to the MCVE a suite that does the same tests but without Supertest.

The same test case (shutting down while a request is in progress) fails on ECONNREFUSED.

However so does making a request after the server shut down (about which I don't care if it's 503 or ECONNREFUSED, the server finished shutting down).

soryy708 avatar Jun 19 '23 20:06 soryy708

@soryy708 could you spin up a pure express application (non-NestJS-driven) just to see what's the original behavior for your test?

kamilmysliwiec avatar Jun 20 '23 06:06 kamilmysliwiec

With the simplest possible Express application (without even Terminus) it acts the same as with Nest not using Supertest.

const express = require('express');

function createServer() {
  const app = express();
  app.get('/health', async (_req, res) => {
    res.status(200).send();
  })
  return new Promise((resolve, reject) => {
    const server = app.listen(0, error => {
      if (error) {
        reject(error);
        return;
      }
      resolve([server.address().port, () => server.close()]);
    });
  })
}

describe("API main (no Supertest)", () => {
  describe("Lifecycle", () => {
    let port;
    let close;

    beforeEach(async () => {
      [port, close] = await createServer();
    });

    afterEach(async () => {
      await close();
    });

    const probeReadiness = async () => {
      const request = new Request(`http://localhost:${port}/health`, { method: 'GET' });
      const response = await fetch(request);
      return { statusCode: response.status };
    }

    describe("When bootstrapped", () => {
      describe("When running", () => {
        it("Should respond successfully", async () => {
          const response = await probeReadiness();
          expect(response.statusCode).toBe(200);
        });
      });

      describe("When shutting down", () => {
        describe("When a request is already in progress", () => {
          it("Should respond successfully", async () => {
            const responsePromise = probeReadiness();
            const closePromise = close();
            const [response] = await Promise.all([
              responsePromise,
              closePromise,
            ]);
            expect(response.statusCode).toBe(200);
          });
        });

        describe("When a new request is made", () => {
          it("Should respond with 503 Service Unavailable", async () => {
            await close();
            const response = await probeReadiness();
            expect(response.statusCode).toBe(503);
          });
        });
      });
    });
  });
});

Result:

 FAIL  ./index.test.js
  Lifecycle
    When bootstrapped
      When running
        √ Should respond successfully (49 ms)
      When shutting down
        When a request is already in progress
          × Should respond successfully (8 ms)
        When a new request is made
          × Should respond with 503 Service Unavailable (3 ms)

Both on ECONNREFUSED.

I expect that if I use Terminus, it will postpone Express'es shutdown long enough to flush the in-progress requests.

soryy708 avatar Jun 20 '23 18:06 soryy708

If this behavior can be reproduced with the pure express application, then I don't think there's anything we can/should do given we don't really want to alter the default driver's behavior. Perhaps reporting this issue in the terminus package would make more sense then?

kamilmysliwiec avatar Jun 21 '23 08:06 kamilmysliwiec

Are we sure that it's an issue with the Terminus package?

I'm opening an issue in https://github.com/nestjs/terminus.

soryy708 avatar Jun 21 '23 10:06 soryy708

Given that this is the default behavior of HTTP drivers we rely on (Nest is built upon), I wouldn't even call this an "issue" but rather a "feature request". Whether it should be implemented & exposed from the terminus package, or an entirely new package that extends the default driver and adds some extra functionality on top, is a different question.

kamilmysliwiec avatar Jun 21 '23 11:06 kamilmysliwiec

Whatever you want to call it.

It's the default behavior of Express without anything, because Express is not a framework like Nest is.

Express docs tell you that if you want graceful shutdown, you need to add another thing to Express (e.g. @godaddy/terminus, which @nestjs/terminus shares a name with but doesn't depend on).

https://expressjs.com/en/advanced/healthcheck-graceful-shutdown.html

The idea is to observe for SIGINT signals, and when it happens, to have the health check respond with failures, and delay Express'es server.close() until after all pending requests were handled.

soryy708 avatar Jun 21 '23 11:06 soryy708

It's the default behavior of Express without anything, because Express is not a framework like Nest is.

This doesn't imply we should be changing the driver's default behavior, especially by default.

Express docs tell you that if you want graceful shutdown, you need to add another thing to Express (e.g. @godaddy/terminus, which @nestjs/terminus shares a name with but doesn't depend on).

So I'd say exposing this feature from the @nestjs/terminus package would make sense.

kamilmysliwiec avatar Jun 21 '23 11:06 kamilmysliwiec

@kamilmysliwiec I have done some investigation on this.

It seems Fastify has an "option" for handling things like this. They introduce an option called return503OnClose: https://github.com/fastify/fastify/blob/main/lib/route.js#L437-L449

Then before Fastify performs the close logic, they return a 503 on any open request in their route handler.

I can see how this issue pops up for regular health checks. But this can also be useful for SSE! I think we should consider keeping this logic in @nestjs/platform-express, but gate it behind an option. Just like Fastify does. We can then introduce this option on the application level. But if you're of the opinion that we should handle this in @nestjs/terminus, we can keep it that way!

tolgap avatar Jun 21 '23 11:06 tolgap

I can see how this issue pops up for regular health checks. But this can also be useful for SSE! I think we should consider keeping this logic in @nestjs/platform-express, but gate it behind an option. Just like Fastify does. We can then introduce this option on the application level. But if you're of the opinion that we should handle this in @nestjs/terminus, we can keep it that way!

Either way works for me. Since it can be useful for SSE too (and we expose abstractions for SSE from the core) then having it in the platform-specific packages would make sense too!

kamilmysliwiec avatar Jun 21 '23 11:06 kamilmysliwiec

Hey @kamilmysliwiec Do you have an ETA for this feature request?

cmdlucas avatar Jan 26 '24 16:01 cmdlucas

Have you tried using this package https://www.npmjs.com/package/nestjs-graceful-shutdown?

kamilmysliwiec avatar Jan 29 '24 08:01 kamilmysliwiec