exporters
exporters copied to clipboard
S3Reader current behavior is unreliable
We're hitting a bug with S3Reader that happens when files are big-ish destination is a bit slow.
While it's uploading the file, it gets a connection reset (in this suspicious function, it has a retry but depends on reading state) and then it recovers by starting to read from the next key file -- causing a gap in the output.
This was introduced by the changes done to make it work in a stream fashion for when there is less memory and disk available -- which makes retrying and resuming much harder.
Apart from reverting those changes, I'm having trouble to think of another solution for this.
Perhaps we can gain some ideas from https://github.com/GoogleCloudPlatform/gsutil/tree/master/gslib on how that tool handles transferring files?
Just merged PR https://github.com/scrapinghub/exporters/pull/326, which should fix most of the issues.
It introduces a retry generator decorator, which is used to retry reading from a stream, keeping track of an offset for the records it already read.
Why is this still open?
No idea but looks like it can be closed now.
@eliasdorneles are you ok with this being closed now with the stream code now in place?
Sure, I haven't checked the behavior since the latest changes, but if you fellows are happy, I'm happy. :+1: