knox-mpu icon indicating copy to clipboard operation
knox-mpu copied to clipboard

ECONNRESET when uploading a large file.

Open drob opened this issue 10 years ago • 9 comments

I'm getting ECONNRESET errors when uploading a 350mb file with knox-mpu.

In particular:

{
    "part": 1,
    "message": {
        "code": "ECONNRESET",
        "errno": "ECONNRESET",
        "syscall": "read"
    }
}

The part specified is different each time but is always between 1 and 4. (I am using the default batchSize of 4.)

Is there any other info that would be helpful in debugging this?

drob avatar May 29 '14 19:05 drob

@drob hey man, just a guess, but have you tried:

// Max # of miliseconds client sockets (i.e. for our purposes: **requests**) should be allowed to stay connected to this particular route.
// 0 = infinite
res.setTimeout(0);

(see https://github.com/andrewrk/node-multiparty/issues/49#issuecomment-42763406 for details)

mikermcneil avatar May 31 '14 00:05 mikermcneil

I'm getting these errors as well.

Could it be caused by S3 rate limiting? See http://blog.blitline.com/post/29157492002/things-to-know-about-s3 or https://github.com/LearnBoost/knox/issues/199

luccastera avatar Jun 03 '14 22:06 luccastera

OK- so what I posted before is really the solution for a different issue, involving aborted requests (although you'll want to consider it as well). As for the issue at hand, here's the best of my understanding atm:

ECONNRESET started showing up in node 0.10- it was stifled before that. It seems to be improved in 0.11, but it will still sometimes fire. It seems that the situation will be greatly improved in Node v0.12, but that doesn't help us now.

Anyways, ECONNRESET originates when a TCP client receives an unexpected RST signal -or- potentially (not sure on this) even if it receives a FIN before an expected ACK from an earlier SYN. Furthermore this seems to be an unavoidable result of dealing with S3, at least for the moment. This very well may be b/c of what @dambalah just pointed out:

Could it be caused by S3 rate limiting?

So nonetheless, the question becomes "how do we address it?" @sgress454 put together a workaround, for which we're going to send another PR to knox-mpu soon (hopefully by Monday at the latest). We saw promising results in a test of a 160MB file upload, and just need to take it out for a few more spins. Essentially, the reason knox is crashing on ECONNRESET is two-fold:

  1. The res stream here needs an .on('error', ...) handler.

  2. The existing .on('error', ...) handler for the knox client itself (here) needs a condition variable to make sure the callback to batch (or if you're using @dustMason's fork, async) is called only once.

Btw, here's some additional background on ECONNRESET in case anyone smarter than me comes along and knows more about what's going on here :)

From http://stackoverflow.com/questions/17245881/node-js-econnreset:

"ECONNRESET" means the other side of the TCP conversation abruptly closed its end of the connection. This is most probably due to one or more application protocol errors. You could look at the API server logs to see if it complains about something.

Sources:

  • http://blog.gluwer.com/2014/03/story-of-eaddrinuse-and-econnreset-errors/
  • https://groups.google.com/forum/#!topic/nodejs/Sc-_U-aoMsU
  • https://github.com/nodejitsu/node-http-proxy/issues/579
  • https://github.com/joyent/node/issues/5542
  • https://github.com/LearnBoost/knox/issues/199#issuecomment-26233842

mikermcneil avatar Jun 07 '14 02:06 mikermcneil

  1. The res stream here needs an .on('error', ...) handler.

  2. The existing .on('error', ...) handler for the knox client itself (here) needs a condition variable to make sure the callback to batch (or if you're using @dustMason's fork, async) is called only once.

@nathanoehlman are you cool w/ merging fixes to those two things?

mikermcneil avatar Jun 07 '14 02:06 mikermcneil

@mikermcneil Thanks for looking deeply into this one! I think your 2 suggestions are spot on.

dustMason avatar Jun 07 '14 23:06 dustMason

Update: it doesn't appear that adding the .on('error') handler for the response stream prevents the ECONNRESET errors from occurring. However, our workaround involving checking that the callback is only called once was successful in handling the issue.

sgress454 avatar Jul 07 '14 23:07 sgress454

Fwiw, adding a maxRetries setting to my uploads fixed this issue for me. (That option wasn't documented when I first started using knox-mpu.)

I'm not sure how to fix the underlying issue, though, or if there even is one. (If I'm uploading a 350mb file, it's reasonable for one of the chunks to fail at some point, right?)

Is there a philosophical reason a default maxRetries of 3, e.g., might not be preferable?

drob avatar Jul 25 '14 01:07 drob

The underlying issue is that sometimes a chunk will upload successfully, but later send an ECONNRESET error anyway. The knox-mpu code handles this by declaring that the chunk was invalid and retrying it, or by failing altogether, when really the event should just be ignored.

sgress454 avatar Jul 25 '14 01:07 sgress454

To add to that, node <= 0.8 didn't even used to announce these sorts of tcp errors-- it has to do with unexpected packets being received after sending the FIN, eg if an ACK is late, but still arrives, or s3 tries to give us more data than we wanted and shoots over an extra SYN or whatever

Mike's phone

On Jul 24, 2014, at 20:40, sgress454 [email protected] wrote:

The underlying issue is that sometimes a chunk will upload successfully, but later send an ECONNRESET error anyway. The knox-mpu code handles this by declaring that the chunk was invalid and retrying it, or by failing altogether, when really the event should just be ignored.

— Reply to this email directly or view it on GitHub.

mikermcneil avatar Jul 26 '14 12:07 mikermcneil