exporters icon indicating copy to clipboard operation
exporters copied to clipboard

S3Reader current behavior is unreliable

Open eliasdorneles opened this issue 8 years ago • 6 comments

We're hitting a bug with S3Reader that happens when files are big-ish destination is a bit slow.

While it's uploading the file, it gets a connection reset (in this suspicious function, it has a retry but depends on reading state) and then it recovers by starting to read from the next key file -- causing a gap in the output.

This was introduced by the changes done to make it work in a stream fashion for when there is less memory and disk available -- which makes retrying and resuming much harder.

Apart from reverting those changes, I'm having trouble to think of another solution for this.

eliasdorneles avatar Jul 21 '16 16:07 eliasdorneles

Perhaps we can gain some ideas from https://github.com/GoogleCloudPlatform/gsutil/tree/master/gslib on how that tool handles transferring files?

tsrdatatech avatar Jul 21 '16 16:07 tsrdatatech

Just merged PR https://github.com/scrapinghub/exporters/pull/326, which should fix most of the issues.

It introduces a retry generator decorator, which is used to retry reading from a stream, keeping track of an offset for the records it already read.

eliasdorneles avatar Aug 15 '16 13:08 eliasdorneles

Why is this still open?

josericardo avatar Nov 16 '16 16:11 josericardo

No idea but looks like it can be closed now.

tsrdatatech avatar Nov 16 '16 16:11 tsrdatatech

@eliasdorneles are you ok with this being closed now with the stream code now in place?

tsrdatatech avatar Nov 29 '16 14:11 tsrdatatech

Sure, I haven't checked the behavior since the latest changes, but if you fellows are happy, I'm happy. :+1:

eliasdorneles avatar Nov 29 '16 16:11 eliasdorneles