pyvips Read from s3?

can we read directly from s3 without downloading the file locally? In my case its a svs file of roughly 1 gb size. Cheers

Dec 05 '19 17:12 misteliy

libvips 8.9 has a feature which might allow this:

https://libvips.github.io/libvips/2019/11/29/True-streaming-for-libvips.html

You'd need to write a small adapter class to do read and seek on large S3 buckets.

Dec 05 '19 17:12 jcupitt

Actually, having said that, openslide will not work with that new stream API, unfortunately.

You'll need to download the whole SVS until the openslide library allows remote read.

Dec 05 '19 17:12 jcupitt

Sorry, I should reply once and think a little longer.

SVS is a TIFF file, so all you'd need to do is swap TIFFOpen for TIFFClientOpen add implement callbacks for read-and-seek-from-URI.

Dec 05 '19 17:12 jcupitt

okay great. To give a bit more context, the ultimate goal would be to do some pre-processing in AWS lambda (s3 triggered)...hence it's crucial to only load a certain level of the svs file...similar to:

level = pyvips.Image.new_from_file(filename, level=0)

The problem is that we can't download the image to lambda since it only offers 512 MB of /tmp. Hence streaming only the relevant layer would be ideal. I hopefully find some time the upcoming weekend to look into this. Thanks a lot for the hints.

Dec 05 '19 17:12 misteliy

How well does S3 handle random seek and read? Does it used http range requests?

Dec 05 '19 17:12 jcupitt

S3 APIs support the HTTP Range: header (see RFC 2616) which take a byte range argument.

Sample S3 call: aws s3api get-object --bucket my_bucket --key path/to/my/file/file1.gz file1.gz --range bytes=1000-2000

Dec 05 '19 18:12 misteliy

That's good. You'll probably find you need a caching layer. TIFF makes a lot of random reads quite close to each other and it'll be horribly slow if you make a round trip for each one.

Dec 05 '19 21:12 jcupitt