pm2 icon indicating copy to clipboard operation
pm2 copied to clipboard

pm2 reload downtime in cluster mode

Open brendonboshell opened this issue 8 years ago • 17 comments

Please see the issue I reported previously - "pm2 reload causing connection timeout and downtime". Using pm2 reload causes connection errors and downtime when running a process in cluster mode. Since I have been stuck running 1.0.1 in production for some time, I have been motivated to track down the cause of this issue. The issue still appears in the latest version 2.6.1.

Using git bisect I have been able to track this issue down to commit d0a3f49 "(god)(stopProcessId) refactor: now it only kill process without disconnecting in cluster mode"

To reproduce

  1. Create basic server, server.js:
var http = require("http"),
    app = require("express")();

app.use("/", function (req, res) {
  return res.send(404);
});

var server = http.createServer(app);
server.listen(4000, function () {
});
  1. Run ./bin/pm2 --no-daemon on master/2.6.1.

  2. Run ../../pm2/bin/pm2 start server.js -i 2 --name api

  3. Run ab -n 100000 -c 1 http://127.0.0.1:4000/v1/

  4. While ab is running, run ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2

  5. Observe the following output:

Benchmarking 127.0.0.1 (be patient)
apr_pollset_poll: The timeout specified has expired (70007)
Total of 1407 requests completed
  1. Run git revert d0a3f49 and observe that ab completes without error.

I have also reproduced this issue with a small script

var Promise = require("bluebird"),
    request = require("request");

setInterval(function () {
  var startDate = new Date();
  Promise.promisify(request)({
    uri: "http://127.0.0.1:4000/v1/",
    timeout: 10000,
    forever: false
  }).then(function (res) {
    console.log(new Date(), res.statusCode, (new Date() - startDate));

    if (res.statusCode !== 404) {
      process.exit(0);
    }
  }).catch(function (err) {
    console.log(new Date(), err, (new Date() - startDate));
    process.exit(0);
  });
}, 5);

Solution

It appears that reverting d0a3f49 solves the issue, but I am not sure what the motivation for that change was. I have been running 2.6.1 in production for about a week and, since I regularly use pm2 reload, I have noticed a number of connection errors in my nginx logs. I suspect this issue is related to the above.

Update (10 Sep)

I have downgraded PM2 2.1.1 overnight on my production machine. See this chart from my status server for the past 24 hours. I use pm2 reload every hour and you can clearly see many downtime running under PM2 2.6.1 and no downtime while running PM2 2.1.1.

PM2 2.1.1 vs PM2 2.6.1

brendonboshell avatar Sep 10 '17 00:09 brendonboshell

Could you use this snippet (from the docs) :

var http = require("http"),
    app = require("express")();

app.use("/", function (req, res) {
  return res.send(404);
});

var server = http.createServer(app);
server.listen(4000, function () {
   process.send('ready')
});

process.on('SIGINT', function() {
   server.close(function(err) {
     process.exit(err ? 1 : 0);
   });
});

And then start it with pm2 start server.js -i 2 --name api --wait-ready

vmarchaud avatar Sep 10 '17 14:09 vmarchaud

I have tried with the above snippet and experience the same issue.

brendonboshell avatar Sep 10 '17 15:09 brendonboshell

Hello, i have the same bug. It used to work, but now with node v8.5.0 pm2 2.7.2

graceful reload is not doing it's job, my process are reloaded at the same time.

setTimeout(function() {
	console.log('ready ' + new Date());
	process.send('ready');
}, 5000);


process.on('SIGINT', function() {
	console.log('i quit' + new Date());
});

pm2 start test.js -i 2 --wait-ready --listen-timeout 15000

I downgraded to [email protected] . It's working if i > 2 , if i =2 , both are rebooted at the same time.

laurentdebricon avatar Nov 06 '17 09:11 laurentdebricon

I was looking at some of the code in the commit that @brendonboshell referenced. I noticed that this line which is the cb(...) in this snippet:

if (!proc.process.pid) {
  console.error('app=%s id=%d does not have a pid', proc.pm2_env.name, proc.pm2_env.pm_id);
  proc.pm2_env.status = cst.STOPPED_STATUS;
  return cb(null, { error : true, message : 'could not kill process w/o pid'});
}

Has what looks like an error object as the second parameter. Is that correct? i.e. we usually pass the error object as the first parameter in the callback and this pattern is followed in the other parts of the code in that file. But it might be that this isn't a "true" error condition which is why it's being signaled as a property and being passed in the second param.

guyellis avatar Nov 11 '17 03:11 guyellis

I can't get version 2.1.1 to reload with zero downtime in my environment using the same approach that @brendonboshell has used with ab.

guyellis avatar Nov 13 '17 22:11 guyellis

@laurentdebricon on ver 2.1.1 I just added an extra process with pm2 scale myapp +1 and then did a gracefulReload and it looks like the first two are restarted at the same time and the 3rd later. I still can't get Apache's ab to be able to hammer it without lost connections. You can see the first two started 53s ago and the 3rd 31s ago.

$ pm2 gracefulReload myapp
Restarts are now immutable, to update environment or conf use --update-env
[PM2] Applying action softReloadProcessId on app [myapp](ids: 0,1,2)
[PM2] [myapp](0) ✓
[PM2] [myapp](1) ✓
[PM2] [myapp](2) ✓
┌──────────┬────┬─────────┬───────┬────────┬─────────┬────────┬─────┬────────────┬──────────┐
│ App name │ id │ mode    │ pid   │ status │ restart │ uptime │ cpu │ mem        │ watching │
├──────────┼────┼─────────┼───────┼────────┼─────────┼────────┼─────┼────────────┼──────────┤
│ myapp    │ 0  │ cluster │ 27844 │ online │ 3       │ 53s    │ 0%  │ 145.0 MB   │ disabled │
│ myapp    │ 1  │ cluster │ 27850 │ online │ 3       │ 53s    │ 0%  │ 143.3 MB   │ disabled │
│ myapp    │ 2  │ cluster │ 27973 │ online │ 2       │ 31s    │ 0%  │ 144.2 MB   │ disabled │
└──────────┴────┴─────────┴───────┴────────┴─────────┴────────┴─────┴────────────┴──────────┘

guyellis avatar Nov 21 '17 20:11 guyellis

It just seems to be those first two. If I scale out to 10 instances and reload the first two reloaded at the same time and the rest are distributed.

┌──────────┬────┬─────────┬───────┬────────┬─────────┬────────┬─────┬────────────┬──────────┐
│ App name │ id │ mode    │ pid   │ status │ restart │ uptime │ cpu │ mem        │ watching │
├──────────┼────┼─────────┼───────┼────────┼─────────┼────────┼─────┼────────────┼──────────┤
│ myapp    │ 0  │ cluster │ 14228 │ online │ 5       │ 103s   │ 0%  │ 101.2 MB   │ disabled │
│ myapp    │ 1  │ cluster │ 14234 │ online │ 5       │ 103s   │ 0%  │ 99.2 MB    │ disabled │
│ myapp    │ 2  │ cluster │ 14315 │ online │ 3       │ 95s    │ 0%  │ 97.8 MB    │ disabled │
│ myapp    │ 3  │ cluster │ 14284 │ online │ 1       │ 99s    │ 0%  │ 99.6 MB    │ disabled │
│ myapp    │ 4  │ cluster │ 14343 │ online │ 2       │ 94s    │ 0%  │ 98.9 MB    │ disabled │
│ myapp    │ 5  │ cluster │ 14552 │ online │ 2       │ 75s    │ 0%  │ 99.8 MB    │ disabled │
│ myapp    │ 6  │ cluster │ 14598 │ online │ 2       │ 72s    │ 0%  │ 106.3 MB   │ disabled │
│ myapp    │ 7  │ cluster │ 14679 │ online │ 1       │ 61s    │ 0%  │ 142.2 MB   │ disabled │
│ myapp    │ 8  │ cluster │ 14716 │ online │ 1       │ 59s    │ 0%  │ 142.4 MB   │ disabled │
│ myapp    │ 9  │ cluster │ 14929 │ online │ 1       │ 41s    │ 0%  │ 141.9 MB   │ disabled │
└──────────┴────┴─────────┴───────┴────────┴─────────┴────────┴─────┴────────────┴──────────┘

guyellis avatar Nov 21 '17 20:11 guyellis

Bump

djbobbydrake avatar Mar 14 '18 11:03 djbobbydrake

Can confirm, 3.1.2 loses requests on reload, switching to 2.1.1 fixes the issue

wirtsi avatar Sep 19 '18 12:09 wirtsi

@brendonboshell @laurentdebricon

I am running on 3.2.2 and I have fixed this problem for myself by using pm2 reload appname --parallel 1. The issue was that pm2 performs the reload concurrently in pairs, which will be bad for anyone using only 2 instances. I created a suggestion issue here to adjust how that works https://github.com/Unitech/pm2/issues/4047 I hope this may solve your problems as well!

curtisbelt avatar Nov 29 '18 18:11 curtisbelt

For using an ecosystem.json file, I found that --parallel would no longer work. It was also not configurable in the json file. I was able to resolve this by using system environment variable PM2_CONCURRENT_ACTIONS, like so:

#!/bin/bash
cd /path/to/appname
PM2_CONCURRENT_ACTIONS=1 pm2 reload ecosystem.json

Source:

https://github.com/Unitech/pm2/blob/91786108d71b3fc6c182750c09b494619e28b28a/constants.js#L88-L93

curtisbelt avatar Dec 01 '18 06:12 curtisbelt

I'm still experiencing downtime when using pm2 reload {processname}. I started the serve using ecosystem file. PM2 is sitting behind NGINX reverse proxy.

Most of the times when I pm2 reload myproc, hit on the api returns 502. Only some of the times it reloads gracefully and without any downtime. What should I change?

harshmandan avatar Oct 16 '20 10:10 harshmandan

I'm still experiencing downtime when using pm2 reload {processname}. I started the serve using ecosystem file. PM2 is sitting behind NGINX reverse proxy.

Most of the times when I pm2 reload myproc, hit on the api returns 502. Only some of the times it reloads gracefully and without any downtime. What should I change?

I am facing the same issue as you faced. Do you have any solution for that?

jeffreytkj avatar May 25 '21 12:05 jeffreytkj

I'm still experiencing downtime when using pm2 reload {processname}. I started the serve using ecosystem file. PM2 is sitting behind NGINX reverse proxy.

Most of the times when I pm2 reload myproc, hit on the api returns 502. Only some of the times it reloads gracefully and without any downtime. What should I change?

I am facing the same issue, any solution ?

ibraah88 avatar Aug 28 '21 10:08 ibraah88

I'm still experiencing downtime when using pm2 reload {processname}. I started the serve using ecosystem file. PM2 is sitting behind NGINX reverse proxy.

Most of the times when I pm2 reload myproc, hit on the api returns 502. Only some of the times it reloads gracefully and without any downtime. What should I change?

+1, same issue, use pm2 (5.1.2) cluster-mode, nodejs app and json config in app root.

nursultan156 avatar Feb 10 '22 10:02 nursultan156

Can confirm, even the "graceful restart" that is supposed to happen via max_memory_restart doesn't work well in v5.3.0. Downgrading to v2.1 removes over 80% of the connection closed before message completed errors I otherwise see.

strokirk avatar Apr 19 '23 14:04 strokirk

+1, same issue with pm2(5.3.0), nextjs and nginx.

hiepnsx avatar Feb 28 '24 10:02 hiepnsx