dask-image Example data

As a good large dataset to work with and show examples on, it might be nice to have an EM dataset. Hopefully others can point us to something on cloud storage that would be easy to access with friendly licensing.

cc @stephenplaza @perlman

Aug 15 '19 10:08 jakirkham

A good place to start would be EMPIAR: https://www.ebi.ac.uk/pdbe/emdb/empiar/

Aug 15 '19 22:08 GenevieveBuckley

All of the FAFB is open (CC BY-NC 4.0) and hosted by Google, though it's in the Neuroglancer precomputed format and would need a utility like cloud-volume to read.

The synapse predictions are in a cloud-hosted n5 volume, but may not be all that interesting on their own? (Though you could have examples calculating the density of synapses, etc.)

Aug 16 '19 01:08 perlman

I am interested in the cloud-volume use case, but the CloudVolume object is not array-like enough to work dask and the cloud-volume library uses some methods that dask arrays do not expose.

I have contemplated a wrapper to paper over the differences, but am not sure how much of the API would need to be implemented.

Example of upload. The cloud-volume upload steps have their own chunking, which could be different than the initial dask array. It may "just work" because it is just doing a slice under the hood. Of course, it would probably be nicer to use map_blocks and maintain the same chunking. A basic upload fails when cloud-volume tries to call tostring to persist the data.

import cloudvolume
import dask.array as da
import numpy as np
chunks = (8,8,8)

data = da.zeros((32,32,8,2), chunks=chunks + (1,))

vol = cloudvolume.CloudVolume.from_numpy(
    data,
    chunk_size=chunks,
    layer_type='image',
)

~/.local/share/virtualenvs/notebook-MrZAEDwa/lib/python3.6/site-packages/cloudvolume/chunks.py in encode_raw(subvol)
    135 
    136 def encode_raw(subvol):
--> 137   return subvol.tostring('F')
    138 
    139 def encode_kempressed(subvol):

AttributeError: 'Array' object has no attribute 'tostring'

And during download we find that the shape of a CloudVolume object is an array (not a tuple), and does not play well with dask. In general, access to the underlying shape/chunks does not follow the dask API.

vol2 = cloudvolume.CloudVolume(vol.cloudpath)
da.from_array(vol2, chunks=tuple(vol2.chunk_size) + (1,))

~/.local/share/virtualenvs/notebook-MrZAEDwa/lib/python3.6/site-packages/dask/array/core.py in normalize_chunks(chunks, shape, limit, dtype, previous_chunks)
   2410 
   2411     if (
-> 2412         shape
   2413         and len(shape) == 1
   2414         and len(chunks) > 1

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Dec 06 '19 00:12 chrisroat

Maybe you can roll your own custom loader to get things into a NumPy array and then feed those into Dask?

<shameless plug>This blogpost may be useful for doing that.</shameless plug>

Dec 06 '19 02:12 jakirkham

Good thoughts. One of my aims is to avoid instantiating the data into NumPy arrays, and only lazily read data when needed. I've created some small wrappers shown in this gist, which seem to do some of necessary basics. I don't know the insides of dask, and how well this would work in a massive task graph. There might be additional methods needed for handling chunks, caching, distributed computing, etc.

Dec 07 '19 19:12 chrisroat

Dask is already lazy anyways. So instantiating NumPy arrays in each chunk should work well.

Dec 08 '19 00:12 jakirkham

dask-image
dask-image copied to clipboard

Example data - EM dataset

dask-image dask-image copied to clipboard

Example data - EM dataset

dask-image
dask-image copied to clipboard