knox-mpu
knox-mpu copied to clipboard
ECONNRESET when uploading a large file.
I'm getting ECONNRESET errors when uploading a 350mb file with knox-mpu.
In particular:
{
"part": 1,
"message": {
"code": "ECONNRESET",
"errno": "ECONNRESET",
"syscall": "read"
}
}
The part
specified is different each time but is always between 1 and 4. (I am using the default batchSize
of 4.)
Is there any other info that would be helpful in debugging this?
@drob hey man, just a guess, but have you tried:
// Max # of miliseconds client sockets (i.e. for our purposes: **requests**) should be allowed to stay connected to this particular route.
// 0 = infinite
res.setTimeout(0);
(see https://github.com/andrewrk/node-multiparty/issues/49#issuecomment-42763406 for details)
I'm getting these errors as well.
Could it be caused by S3 rate limiting? See http://blog.blitline.com/post/29157492002/things-to-know-about-s3 or https://github.com/LearnBoost/knox/issues/199
OK- so what I posted before is really the solution for a different issue, involving aborted requests (although you'll want to consider it as well). As for the issue at hand, here's the best of my understanding atm:
ECONNRESET started showing up in node 0.10- it was stifled before that. It seems to be improved in 0.11, but it will still sometimes fire. It seems that the situation will be greatly improved in Node v0.12, but that doesn't help us now.
Anyways, ECONNRESET originates when a TCP client receives an unexpected RST signal -or- potentially (not sure on this) even if it receives a FIN before an expected ACK from an earlier SYN. Furthermore this seems to be an unavoidable result of dealing with S3, at least for the moment. This very well may be b/c of what @dambalah just pointed out:
Could it be caused by S3 rate limiting?
So nonetheless, the question becomes "how do we address it?" @sgress454 put together a workaround, for which we're going to send another PR to knox-mpu soon (hopefully by Monday at the latest). We saw promising results in a test of a 160MB file upload, and just need to take it out for a few more spins. Essentially, the reason knox is crashing on ECONNRESET is two-fold:
-
The
res
stream here needs an.on('error', ...)
handler. -
The existing
.on('error', ...)
handler for the knox client itself (here) needs a condition variable to make sure the callback tobatch
(or if you're using @dustMason's fork,async
) is called only once.
Btw, here's some additional background on ECONNRESET in case anyone smarter than me comes along and knows more about what's going on here :)
From http://stackoverflow.com/questions/17245881/node-js-econnreset:
"ECONNRESET" means the other side of the TCP conversation abruptly closed its end of the connection. This is most probably due to one or more application protocol errors. You could look at the API server logs to see if it complains about something.
Sources:
- http://blog.gluwer.com/2014/03/story-of-eaddrinuse-and-econnreset-errors/
- https://groups.google.com/forum/#!topic/nodejs/Sc-_U-aoMsU
- https://github.com/nodejitsu/node-http-proxy/issues/579
- https://github.com/joyent/node/issues/5542
- https://github.com/LearnBoost/knox/issues/199#issuecomment-26233842
The res stream here needs an .on('error', ...) handler.
The existing .on('error', ...) handler for the knox client itself (here) needs a condition variable to make sure the callback to batch (or if you're using @dustMason's fork, async) is called only once.
@nathanoehlman are you cool w/ merging fixes to those two things?
@mikermcneil Thanks for looking deeply into this one! I think your 2 suggestions are spot on.
Update: it doesn't appear that adding the .on('error')
handler for the response stream prevents the ECONNRESET errors from occurring. However, our workaround involving checking that the callback is only called once was successful in handling the issue.
Fwiw, adding a maxRetries
setting to my uploads fixed this issue for me. (That option wasn't documented when I first started using knox-mpu.)
I'm not sure how to fix the underlying issue, though, or if there even is one. (If I'm uploading a 350mb file, it's reasonable for one of the chunks to fail at some point, right?)
Is there a philosophical reason a default maxRetries
of 3, e.g., might not be preferable?
The underlying issue is that sometimes a chunk will upload successfully, but later send an ECONNRESET error anyway. The knox-mpu code handles this by declaring that the chunk was invalid and retrying it, or by failing altogether, when really the event should just be ignored.
To add to that, node <= 0.8 didn't even used to announce these sorts of tcp errors-- it has to do with unexpected packets being received after sending the FIN, eg if an ACK is late, but still arrives, or s3 tries to give us more data than we wanted and shoots over an extra SYN or whatever
Mike's phone
On Jul 24, 2014, at 20:40, sgress454 [email protected] wrote:
The underlying issue is that sometimes a chunk will upload successfully, but later send an ECONNRESET error anyway. The knox-mpu code handles this by declaring that the chunk was invalid and retrying it, or by failing altogether, when really the event should just be ignored.
— Reply to this email directly or view it on GitHub.