pm2 pm2 reload downtime in cluster mode

Please see the issue I reported previously - "pm2 reload causing connection timeout and downtime". Using pm2 reload causes connection errors and downtime when running a process in cluster mode. Since I have been stuck running 1.0.1 in production for some time, I have been motivated to track down the cause of this issue. The issue still appears in the latest version 2.6.1.

Using git bisect I have been able to track this issue down to commit d0a3f49 "(god)(stopProcessId) refactor: now it only kill process without disconnecting in cluster mode"

To reproduce

Create basic server, server.js:

var http = require("http"),
    app = require("express")();

app.use("/", function (req, res) {
  return res.send(404);
});

var server = http.createServer(app);
server.listen(4000, function () {
});

Run ./bin/pm2 --no-daemon on master/2.6.1.
Run ../../pm2/bin/pm2 start server.js -i 2 --name api
Run ab -n 100000 -c 1 http://127.0.0.1:4000/v1/
While ab is running, run ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2
Observe the following output:

Benchmarking 127.0.0.1 (be patient)
apr_pollset_poll: The timeout specified has expired (70007)
Total of 1407 requests completed

Run git revert d0a3f49 and observe that ab completes without error.

I have also reproduced this issue with a small script

var Promise = require("bluebird"),
    request = require("request");

setInterval(function () {
  var startDate = new Date();
  Promise.promisify(request)({
    uri: "http://127.0.0.1:4000/v1/",
    timeout: 10000,
    forever: false
  }).then(function (res) {
    console.log(new Date(), res.statusCode, (new Date() - startDate));

    if (res.statusCode !== 404) {
      process.exit(0);
    }
  }).catch(function (err) {
    console.log(new Date(), err, (new Date() - startDate));
    process.exit(0);
  });
}, 5);

Solution

It appears that reverting d0a3f49 solves the issue, but I am not sure what the motivation for that change was. I have been running 2.6.1 in production for about a week and, since I regularly use pm2 reload, I have noticed a number of connection errors in my nginx logs. I suspect this issue is related to the above.

Update (10 Sep)

I have downgraded PM2 2.1.1 overnight on my production machine. See this chart from my status server for the past 24 hours. I use pm2 reload every hour and you can clearly see many downtime running under PM2 2.6.1 and no downtime while running PM2 2.1.1.

PM2 2.1.1 vs PM2 2.6.1

Sep 10 '17 00:09 brendonboshell

Could you use this snippet (from the docs) :

var http = require("http"),
    app = require("express")();

app.use("/", function (req, res) {
  return res.send(404);
});

var server = http.createServer(app);
server.listen(4000, function () {
   process.send('ready')
});

process.on('SIGINT', function() {
   server.close(function(err) {
     process.exit(err ? 1 : 0);
   });
});

And then start it with pm2 start server.js -i 2 --name api --wait-ready

Sep 10 '17 14:09 vmarchaud

I have tried with the above snippet and experience the same issue.

Sep 10 '17 15:09 brendonboshell

Hello, i have the same bug. It used to work, but now with node v8.5.0 pm2 2.7.2

graceful reload is not doing it's job, my process are reloaded at the same time.

setTimeout(function() {
	console.log('ready ' + new Date());
	process.send('ready');
}, 5000);


process.on('SIGINT', function() {
	console.log('i quit' + new Date());
});

pm2 start test.js -i 2 --wait-ready --listen-timeout 15000

I downgraded to [email protected] . It's working if i > 2 , if i =2 , both are rebooted at the same time.

Nov 06 '17 09:11 laurentdebricon

I was looking at some of the code in the commit that @brendonboshell referenced. I noticed that this line which is the cb(...) in this snippet:

if (!proc.process.pid) {
  console.error('app=%s id=%d does not have a pid', proc.pm2_env.name, proc.pm2_env.pm_id);
  proc.pm2_env.status = cst.STOPPED_STATUS;
  return cb(null, { error : true, message : 'could not kill process w/o pid'});
}

Has what looks like an error object as the second parameter. Is that correct? i.e. we usually pass the error object as the first parameter in the callback and this pattern is followed in the other parts of the code in that file. But it might be that this isn't a "true" error condition which is why it's being signaled as a property and being passed in the second param.

Nov 11 '17 03:11 guyellis

I can't get version 2.1.1 to reload with zero downtime in my environment using the same approach that @brendonboshell has used with ab.

Nov 13 '17 22:11 guyellis

@laurentdebricon on ver 2.1.1 I just added an extra process with pm2 scale myapp +1 and then did a gracefulReload and it looks like the first two are restarted at the same time and the 3rd later. I still can't get Apache's ab to be able to hammer it without lost connections. You can see the first two started 53s ago and the 3rd 31s ago.

$ pm2 gracefulReload myapp
Restarts are now immutable, to update environment or conf use --update-env
[PM2] Applying action softReloadProcessId on app [myapp](ids: 0,1,2)
[PM2] [myapp](0) ✓
[PM2] [myapp](1) ✓
[PM2] [myapp](2) ✓
┌──────────┬────┬─────────┬───────┬────────┬─────────┬────────┬─────┬────────────┬──────────┐
│ App name │ id │ mode    │ pid   │ status │ restart │ uptime │ cpu │ mem        │ watching │
├──────────┼────┼─────────┼───────┼────────┼─────────┼────────┼─────┼────────────┼──────────┤
│ myapp    │ 0  │ cluster │ 27844 │ online │ 3       │ 53s    │ 0%  │ 145.0 MB   │ disabled │
│ myapp    │ 1  │ cluster │ 27850 │ online │ 3       │ 53s    │ 0%  │ 143.3 MB   │ disabled │
│ myapp    │ 2  │ cluster │ 27973 │ online │ 2       │ 31s    │ 0%  │ 144.2 MB   │ disabled │
└──────────┴────┴─────────┴───────┴────────┴─────────┴────────┴─────┴────────────┴──────────┘

Nov 21 '17 20:11 guyellis

It just seems to be those first two. If I scale out to 10 instances and reload the first two reloaded at the same time and the rest are distributed.

┌──────────┬────┬─────────┬───────┬────────┬─────────┬────────┬─────┬────────────┬──────────┐
│ App name │ id │ mode    │ pid   │ status │ restart │ uptime │ cpu │ mem        │ watching │
├──────────┼────┼─────────┼───────┼────────┼─────────┼────────┼─────┼────────────┼──────────┤
│ myapp    │ 0  │ cluster │ 14228 │ online │ 5       │ 103s   │ 0%  │ 101.2 MB   │ disabled │
│ myapp    │ 1  │ cluster │ 14234 │ online │ 5       │ 103s   │ 0%  │ 99.2 MB    │ disabled │
│ myapp    │ 2  │ cluster │ 14315 │ online │ 3       │ 95s    │ 0%  │ 97.8 MB    │ disabled │
│ myapp    │ 3  │ cluster │ 14284 │ online │ 1       │ 99s    │ 0%  │ 99.6 MB    │ disabled │
│ myapp    │ 4  │ cluster │ 14343 │ online │ 2       │ 94s    │ 0%  │ 98.9 MB    │ disabled │
│ myapp    │ 5  │ cluster │ 14552 │ online │ 2       │ 75s    │ 0%  │ 99.8 MB    │ disabled │
│ myapp    │ 6  │ cluster │ 14598 │ online │ 2       │ 72s    │ 0%  │ 106.3 MB   │ disabled │
│ myapp    │ 7  │ cluster │ 14679 │ online │ 1       │ 61s    │ 0%  │ 142.2 MB   │ disabled │
│ myapp    │ 8  │ cluster │ 14716 │ online │ 1       │ 59s    │ 0%  │ 142.4 MB   │ disabled │
│ myapp    │ 9  │ cluster │ 14929 │ online │ 1       │ 41s    │ 0%  │ 141.9 MB   │ disabled │
└──────────┴────┴─────────┴───────┴────────┴─────────┴────────┴─────┴────────────┴──────────┘

Nov 21 '17 20:11 guyellis

Bump

Mar 14 '18 11:03 djbobbydrake

Can confirm, 3.1.2 loses requests on reload, switching to 2.1.1 fixes the issue

Sep 19 '18 12:09 wirtsi

@brendonboshell @laurentdebricon

I am running on 3.2.2 and I have fixed this problem for myself by using pm2 reload appname --parallel 1. The issue was that pm2 performs the reload concurrently in pairs, which will be bad for anyone using only 2 instances. I created a suggestion issue here to adjust how that works https://github.com/Unitech/pm2/issues/4047 I hope this may solve your problems as well!

Nov 29 '18 18:11 curtisbelt

For using an ecosystem.json file, I found that --parallel would no longer work. It was also not configurable in the json file. I was able to resolve this by using system environment variable PM2_CONCURRENT_ACTIONS, like so:

#!/bin/bash
cd /path/to/appname
PM2_CONCURRENT_ACTIONS=1 pm2 reload ecosystem.json

Source:

https://github.com/Unitech/pm2/blob/91786108d71b3fc6c182750c09b494619e28b28a/constants.js#L88-L93

Dec 01 '18 06:12 curtisbelt

I'm still experiencing downtime when using pm2 reload {processname}. I started the serve using ecosystem file. PM2 is sitting behind NGINX reverse proxy.

Most of the times when I pm2 reload myproc, hit on the api returns 502. Only some of the times it reloads gracefully and without any downtime. What should I change?

Oct 16 '20 10:10 harshmandan

I'm still experiencing downtime when using pm2 reload {processname}. I started the serve using ecosystem file. PM2 is sitting behind NGINX reverse proxy.

Most of the times when I pm2 reload myproc, hit on the api returns 502. Only some of the times it reloads gracefully and without any downtime. What should I change?

I am facing the same issue as you faced. Do you have any solution for that?

May 25 '21 12:05 jeffreytkj

I'm still experiencing downtime when using pm2 reload {processname}. I started the serve using ecosystem file. PM2 is sitting behind NGINX reverse proxy.

Most of the times when I pm2 reload myproc, hit on the api returns 502. Only some of the times it reloads gracefully and without any downtime. What should I change?

I am facing the same issue, any solution ?

Aug 28 '21 10:08 ibraah88

I'm still experiencing downtime when using pm2 reload {processname}. I started the serve using ecosystem file. PM2 is sitting behind NGINX reverse proxy.

Most of the times when I pm2 reload myproc, hit on the api returns 502. Only some of the times it reloads gracefully and without any downtime. What should I change?

+1, same issue, use pm2 (5.1.2) cluster-mode, nodejs app and json config in app root.

Feb 10 '22 10:02 nursultan156

Can confirm, even the "graceful restart" that is supposed to happen via max_memory_restart doesn't work well in v5.3.0. Downgrading to v2.1 removes over 80% of the connection closed before message completed errors I otherwise see.

Apr 19 '23 14:04 strokirk

+1, same issue with pm2(5.3.0), nextjs and nginx.

Feb 28 '24 10:02 hiepnsx