dask-image icon indicating copy to clipboard operation
dask-image copied to clipboard

image reading from s3 is broken with dask_image.imread in current pip / conda build

Open kpasko opened this issue 4 years ago • 8 comments

What happened: Directly reading images from URLs is deprecated since 3.4 and will no longer be supported two minor releases later. Please open the URL for reading and pass the result to Pillow, e.g. with PIL.Image.open(urllib.request.urlopen(url)).

What you expected to happen: Reads correctly from s3 url and maintains backward compatibility

Minimal Complete Verifiable Example:

import dask.dataframe as dd
df = dd.read_csv('s3://mybucket/mycsv.csv')   #no problem


import dask_image.imread
img = dask_image.imread.imread('s3://mybucket/myimg.png') # yes problem

Anything else we need to know?:

Environment:

dask 2021.5.0 pyhd8ed1ab_0 conda-forge dask-core 2021.5.0 pyhd8ed1ab_0 conda-forge dask-image 0.6.0 pyhd8ed1ab_0 conda-forge pillow 8.2.0 py39h5fdd921_1 conda-forge pims 0.5 pyh9f0ad1d_1 conda-forge s3fs 2021.5.0 pyhd8ed1ab_0 conda-forge

  • Python version: 3.9.4
  • Operating System: OSX Big Sir v 11.3

kpasko avatar May 25 '21 20:05 kpasko

Think this is one of the cases that dask.array.image.imread handles well. Would try using that here

jakirkham avatar May 25 '21 20:05 jakirkham

same env but scikit-image installed, dask.array.image.imread('s3://mybucket/myimg.png')

returns No files found under name s3://mybucket/myimg.png

FWIW, this works fine

import skimage.io
import io
import boto3

boto_session = boto3.Session()
s3 = boto_session.client("s3")
stream = s3.get_object(Bucket="mybucket" , Key="myimg.png")['Body']
data = io.BytesIO(stream.read( ) )
img = skimage.io.imread(data)

kpasko avatar May 25 '21 21:05 kpasko

Hmm...interesting. Seem to recall that working in the past. Maybe it doesn't any longer

In any event, read_csv is doing lots of clever things. Though maybe some of it could be repurposed to handle the image loading case better.

For now a reasonable thing to do would be just use dask.delayed to roll your own reader.

jakirkham avatar May 25 '21 23:05 jakirkham

looks like the dask.array.image code is using glob. I'd imagine a recent change (or version incompatibility) with s3fs/fsspec/boto3/etc means s3:/ is not being mounted locally and so can't be glob'd.

My workaround is to use aws-data-wrangler to query the glob string from s3 (could of course use paginator or whatnot as well), and then

def ski_read(fn):
    output = dask.bytes.read_bytes(fn,include_path=False,sample=False)
    data = output[1][0][0]
    return skimage.io.imread(io.BytesIO(data.compute()))

Have to imagine there's a cleaner way, but I wanted to avoid injecting boto3 clients or sessions into delayed/distributed calls

kpasko avatar May 27 '21 00:05 kpasko

Glad you figured out something that works :)

Yeah that's the downside of the dask.array.image.imread. I don't think it gets the same amount of attention as imread here. So not too surprised that it has drifted out-of-sync

Agree there's probably room to improve. Handling cloud based storage seems desirable. Currently we hand things off to PIMS. There's an open issue about handling URLs ( https://github.com/soft-matter/pims/issues/310 ). Not seeing one related to S3. Maybe worth raising?

jakirkham avatar May 27 '21 04:05 jakirkham

To be fair, dask_image.imread.imread also doesn’t work, so it’s not a dask.array vs dask_image issue

On Wed, May 26, 2021 at 9:43 PM jakirkham @.***> wrote:

Glad you figured out something that works :)

Yeah that's the downside of the dask.array.image.imread. I don't think it gets the same amount of attention as imread here. So not too surprised that it has drifted out-of-sync

Agree there's probably room to improve. Handling cloud based storage seems desirable. Currently we hand things off to PIMS. There's an open issue about handling URLs ( soft-matter/pims#310 https://github.com/soft-matter/pims/issues/310 ). Not seeing one related to S3. Maybe worth raising?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dask/dask-image/issues/234#issuecomment-849314975, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABI2ZYGOHGRXIZGRGW4HETTPXEY5ANCNFSM45QEWFYA .

kpasko avatar May 27 '21 04:05 kpasko

Yep that's true. Though we already knew imread in dask-image didn't work. dask.array's did work previously (so the fact it doesn't is news)

jakirkham avatar May 27 '21 04:05 jakirkham

If anything it highlights an infrastructural issue that dask core and the various “affiliates” aren’t really integrated as far as core functionality (or at least in this instance when it comes to loading). The fact that my solution is using a pretty simple core dask method rather than any custom approach that is a very niche use case in my mind reinforces that anything “load” or the like regardless of array, image, dataframe, etc probably can be/ought to be based on a standardized dask method.

On Wed, May 26, 2021 at 9:47 PM jakirkham @.***> wrote:

Yep that's true. Though we already knew imread in dask-image didn't work. dask.array's did work previously (so the fact it doesn't is news)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dask/dask-image/issues/234#issuecomment-849315998, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABI2Z6ABGKJQQANMH3OGKLTPXFEXANCNFSM45QEWFYA .

kpasko avatar May 27 '21 04:05 kpasko