dask-image icon indicating copy to clipboard operation
dask-image copied to clipboard

Demo dataset infrastructure

Open GenevieveBuckley opened this issue 3 years ago • 6 comments

I think demo dataset infrastructure would be useful.

I made a PR proposal for napari here: https://github.com/napari/napari/pull/3580 (it's based on scikit-image: they use pooch and like it)

We could have a combination of:

  1. Experimental datasets, and
  2. Synthetic datasets (might be quicker to generate very large images than it is to download them - they just need to have interesting structures)

GenevieveBuckley avatar Nov 04 '21 07:11 GenevieveBuckley

There are a bunch of other issues discussing ideas for specific example data, I'm linking to them here:

  • https://github.com/dask/dask-image/issues/107
  • https://github.com/dask/dask-image/issues/135
  • https://github.com/dask/dask-image/issues/128
  • https://github.com/dask/dask-image/issues/125
  • https://github.com/napari/napari/issues/316

GenevieveBuckley avatar Nov 04 '21 07:11 GenevieveBuckley

Get the sense that pooch is mainly used to download data. Is that correct? Or can it also be made to query portions of data directly from the cloud?

jakirkham avatar Nov 11 '21 19:11 jakirkham

Get the sense that pooch is mainly used to download data. Is that correct? Or can it also be made to query portions of data directly from the cloud?

Pooch is only for downloading & extracting data. You give it a filename/url, and pooch fetches it for you.

If you want to query potions of a dataset, you'd need that dataset to be stored in some kind of chunked format to begin with, and some idea about how you want to do that querying. So it could be possible with a remote HDF5 (or zarr?) array.

One thing to consider would be download speed. I haven't done a bunch of testing, but it seems pretty common sense that zipped/tarred datasets will probably be transferred over the network quicker. So even with the extra time it takes to extract the data once it arrives, it might be quicker overall. That doesn't mean you have to do it that way, just one more thing to consider.

GenevieveBuckley avatar Nov 12 '21 04:11 GenevieveBuckley

Yeah Zarr supports Zstandard, which is pretty efficiently compressed. There are some filesystems that use Zstandard. It's also something being explored with Conda packages as well for the same reason (faster downloads, smaller packages, etc.).

We can also query datasets directly from the cloud with Zarr. Here's an example dataset on S3 ( https://github.com/zarr-developers/zarr-python/issues/385#issuecomment-452447219 ).

We can also cache downloaded chunks locally to ensure we only pull from a cloud store once.

I think this really comes down to what size datasets would be used here. If they are small, maybe pooch is fine. If they are large, maybe Zarr would be better.

jakirkham avatar Nov 12 '21 04:11 jakirkham

+1 for zarr wherever applicable

GenevieveBuckley avatar Nov 12 '21 06:11 GenevieveBuckley

A discussion about synthetic data generation is here: https://github.com/napari/napari/issues/3608

GenevieveBuckley avatar Nov 12 '21 10:11 GenevieveBuckley