dask-image
dask-image copied to clipboard
image reading from s3 is broken with dask_image.imread in current pip / conda build
What happened:
Directly reading images from URLs is deprecated since 3.4 and will no longer be supported two minor releases later. Please open the URL for reading and pass the result to Pillow, e.g. with PIL.Image.open(urllib.request.urlopen(url)).
What you expected to happen: Reads correctly from s3 url and maintains backward compatibility
Minimal Complete Verifiable Example:
import dask.dataframe as dd
df = dd.read_csv('s3://mybucket/mycsv.csv') #no problem
import dask_image.imread
img = dask_image.imread.imread('s3://mybucket/myimg.png') # yes problem
Anything else we need to know?:
Environment:
dask 2021.5.0 pyhd8ed1ab_0 conda-forge dask-core 2021.5.0 pyhd8ed1ab_0 conda-forge dask-image 0.6.0 pyhd8ed1ab_0 conda-forge pillow 8.2.0 py39h5fdd921_1 conda-forge pims 0.5 pyh9f0ad1d_1 conda-forge s3fs 2021.5.0 pyhd8ed1ab_0 conda-forge
- Python version: 3.9.4
- Operating System: OSX Big Sir v 11.3
Think this is one of the cases that dask.array.image.imread handles well. Would try using that here
same env but scikit-image installed,
dask.array.image.imread('s3://mybucket/myimg.png')
returns No files found under name s3://mybucket/myimg.png
FWIW, this works fine
import skimage.io
import io
import boto3
boto_session = boto3.Session()
s3 = boto_session.client("s3")
stream = s3.get_object(Bucket="mybucket" , Key="myimg.png")['Body']
data = io.BytesIO(stream.read( ) )
img = skimage.io.imread(data)
Hmm...interesting. Seem to recall that working in the past. Maybe it doesn't any longer
In any event, read_csv is doing lots of clever things. Though maybe some of it could be repurposed to handle the image loading case better.
For now a reasonable thing to do would be just use dask.delayed to roll your own reader.
looks like the dask.array.image code is using glob. I'd imagine a recent change (or version incompatibility) with s3fs/fsspec/boto3/etc means s3:/ is not being mounted locally and so can't be glob'd.
My workaround is to use aws-data-wrangler to query the glob string from s3 (could of course use paginator or whatnot as well), and then
def ski_read(fn):
output = dask.bytes.read_bytes(fn,include_path=False,sample=False)
data = output[1][0][0]
return skimage.io.imread(io.BytesIO(data.compute()))
Have to imagine there's a cleaner way, but I wanted to avoid injecting boto3 clients or sessions into delayed/distributed calls
Glad you figured out something that works :)
Yeah that's the downside of the dask.array.image.imread. I don't think it gets the same amount of attention as imread here. So not too surprised that it has drifted out-of-sync
Agree there's probably room to improve. Handling cloud based storage seems desirable. Currently we hand things off to PIMS. There's an open issue about handling URLs ( https://github.com/soft-matter/pims/issues/310 ). Not seeing one related to S3. Maybe worth raising?
To be fair, dask_image.imread.imread also doesn’t work, so it’s not a dask.array vs dask_image issue
On Wed, May 26, 2021 at 9:43 PM jakirkham @.***> wrote:
Glad you figured out something that works :)
Yeah that's the downside of the dask.array.image.imread. I don't think it gets the same amount of attention as imread here. So not too surprised that it has drifted out-of-sync
Agree there's probably room to improve. Handling cloud based storage seems desirable. Currently we hand things off to PIMS. There's an open issue about handling URLs ( soft-matter/pims#310 https://github.com/soft-matter/pims/issues/310 ). Not seeing one related to S3. Maybe worth raising?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dask/dask-image/issues/234#issuecomment-849314975, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABI2ZYGOHGRXIZGRGW4HETTPXEY5ANCNFSM45QEWFYA .
Yep that's true. Though we already knew imread in dask-image didn't work. dask.array's did work previously (so the fact it doesn't is news)
If anything it highlights an infrastructural issue that dask core and the various “affiliates” aren’t really integrated as far as core functionality (or at least in this instance when it comes to loading). The fact that my solution is using a pretty simple core dask method rather than any custom approach that is a very niche use case in my mind reinforces that anything “load” or the like regardless of array, image, dataframe, etc probably can be/ought to be based on a standardized dask method.
On Wed, May 26, 2021 at 9:47 PM jakirkham @.***> wrote:
Yep that's true. Though we already knew imread in dask-image didn't work. dask.array's did work previously (so the fact it doesn't is news)
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dask/dask-image/issues/234#issuecomment-849315998, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABI2Z6ABGKJQQANMH3OGKLTPXFEXANCNFSM45QEWFYA .