Read from s3?
can we read directly from s3 without downloading the file locally? In my case its a svs file of roughly 1 gb size. Cheers
libvips 8.9 has a feature which might allow this:
https://libvips.github.io/libvips/2019/11/29/True-streaming-for-libvips.html
You'd need to write a small adapter class to do read and seek on large S3 buckets.
Actually, having said that, openslide will not work with that new stream API, unfortunately.
You'll need to download the whole SVS until the openslide library allows remote read.
Sorry, I should reply once and think a little longer.
SVS is a TIFF file, so all you'd need to do is swap TIFFOpen for TIFFClientOpen add implement callbacks for read-and-seek-from-URI.
okay great. To give a bit more context, the ultimate goal would be to do some pre-processing in AWS lambda (s3 triggered)...hence it's crucial to only load a certain level of the svs file...similar to:
level = pyvips.Image.new_from_file(filename, level=0)
The problem is that we can't download the image to lambda since it only offers 512 MB of /tmp. Hence streaming only the relevant layer would be ideal. I hopefully find some time the upcoming weekend to look into this. Thanks a lot for the hints.
How well does S3 handle random seek and read? Does it used http range requests?
S3 APIs support the HTTP Range: header (see RFC 2616) which take a byte range argument.
Sample S3 call: aws s3api get-object --bucket my_bucket --key path/to/my/file/file1.gz file1.gz --range bytes=1000-2000
That's good. You'll probably find you need a caching layer. TIFF makes a lot of random reads quite close to each other and it'll be horribly slow if you make a round trip for each one.