dask-image icon indicating copy to clipboard operation
dask-image copied to clipboard

Example image data for dask-image

Open GenevieveBuckley opened this issue 6 years ago โ€ข 26 comments

dask-image example datasets

We need some good example data for tutorials with dask-image.

This issue is a place for discussion and suggestions. If you have links, add them here!

Ideally this data should:

  • Have a permissive license
  • Be easily downloaded on demand by the user
  • Be big, but not too big. We want something that will automatically be spread over a few dask chunks, but not too large to download. 1 - 2 GB? What makes sense here?

It would be nice to have

  • scientific images (microscopy, astronomy, satellite/geo-spatial images, maybe histology slides)
  • a filetype we don't need another third party library to open (we have pims already, something it could handle would be ideal)
  • the data hosted by someone else, but in a stable situation where we can reasonably count on that continuing

What we want to avoid:

  • Websites that make you register (even for free) before you can download data.

EDIT: https://github.com/napari/napari/issues/316

Just saw this tweet announcing a human brain MRI at 100ยตm isotropic resolution. This could be a very cool dataset to use as a napari demo. I suggest we use this issue to keep track of datasets that we could put in napari once we have proper data downloading. Please just edit the checklist below to add your preferred demo data.

* [ ]  100ยตm resolution human brain: https://twitter.com/ComaRecoveryLab/status/1134436231775961088

* [ ]  10m resolution vegetation cover in Victoria: http://francois-petitjean.com/Research/MonashVegMap/info.php and https://labo.obs-mip.fr/multitemp/mapping-a-part-of-australia-at-10-m-resolution/

* [ ]  correlative superres https://www.biorxiv.org/content/10.1101/773986v1.abstract

* [x]  SARS-CoV2 in gut epithelium https://twitter.com/notjustmoore/status/1256232842755014656

* [ ]  developing sea squirt https://www.nytimes.com/2020/07/09/science/sea-squirts-embryos.html

* [ ]  mechanobiology of intestinal organoids https://twitter.com/XavierTrepat/status/1308026944349450241

* [ ]  tracking of particles on astral microtubules ([paper](https://www.biorxiv.org/content/10.1101/2020.06.17.154260v1), [tweet (๐Ÿ˜)](https://twitter.com/the_Node/status/1341050276011237379)), could make a really neat demo for the tracks layer.

* [ ]  [Sentinel-2 1y Cloud optimised geotiff dataset](https://medium.com/sentinel-hub/digital-twin-sandbox-sentinel-2-collection-available-to-everyone-20f3b5de846e)

* [ ]  Calcium imaging in the Drosophila ellipsoid body ([2013](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3830704/) and [2015](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4704792/))

* [ ]  Janelia FlyLight data ([AWS](https://registry.opendata.aws/janelia-flylight/))

* [ ]  Fruit Fly Brain Observatory (FFBO) ([tweet](https://twitter.com/FlyBrainObs/status/1369496338266750977))

* [ ]  Janelia [Open Organelle datasets](https://openorganelle.janelia.org/)

* [ ]  CZBiohub [Open Cell](https://opencell.czbiohub.org/about)

* [ ]  [CoCo](https://cocodataset.org/#download) + [Voxel51 datasets](https://voxel51.com/docs/fiftyone/user_guide/using_datasets.html)

* [ ]  @tlambert03's lattice light sheet dataset used in the dask application post https://www.ebi.ac.uk/biostudies/studies/S-BSST435?query=Talley%20lambert

two cryo-ET datasets to add to the pile

  • https://zenodo.org/record/6504891 - deconvolved tomogram of HIV virus-like particles + annotations
  • https://zenodo.org/record/6504949 - denoised tomogram of a M. pneumonaie cell

There is also some more developing Tribolium embryos, mouse brain slices and mouse colon volumes:

https://zenodo.org/record/4276076#.YYJKMWDMJaR

https://github.com/napari/napari/issues/316#issuecomment-952642188

... and this 3D cell tracking dataset is gorgeous! http://celltrackingchallenge.net/

C.elegans developing embryo Waterston Lab, University of Washington, Seattle, WA, USA Training dataset: http://data.celltrackingchallenge.net/training-datasets/Fluo-N3DH-CE.zipโœฑ (3.1 GB) Challenge dataset: http://data.celltrackingchallenge.net/challenge-datasets/Fluo-N3DH-CE.zip (1.7 GB)

Microscope: Zeiss LSM 510 Meta Objective lens: Plan-Apochromat 63x/1.4 (oil) Voxel size (microns): 0.09 x 0.09 x 1.0 Time step (min): 1 (1.5) Additional information: Nature Methods, 2008

GenevieveBuckley avatar Mar 05 '19 08:03 GenevieveBuckley

The Cancer Genome Atlas Database might work: https://portal.gdc.cancer.gov/

There are some histology images there that could fit the requirements. They do have file formats that would probably need a third party library to read into python, but you can download individual images separately pretty easily.

GenevieveBuckley avatar Mar 05 '19 08:03 GenevieveBuckley

The xarray examples (like this one, or this one) sometimes uses NetCDF climate data from the Climate Data Store. Website: https://cds.climate.copernicus.eu

  • A user login might be required in all cases
  • There is a python API for data downloads
  • xarray can open NetCDF files fairly easily if netCDF4 is also installed. This might be a lot easier than trying to fiddle with user java installations for python-bioformats.

https://towardsdatascience.com/handling-netcdf-files-using-xarray-for-absolute-beginners-111a8ab4463f

GenevieveBuckley avatar May 07 '19 01:05 GenevieveBuckley

Nick took a histology CC0 image and converted it to zarr - https://camelyon16.grand-challenge.org/Data/

Edit: updated link - https://camelyon17.grand-challenge.org/Data/

GenevieveBuckley avatar Jul 13 '19 22:07 GenevieveBuckley

The landsat data might also be good https://landsat.gsfc.nasa.gov/data/

Some people have said they think landsat is CC0 licensed, but I haven't found that page on the website yet so we better double check.

Here's a wrapper around the API to make it easier: https://github.com/loicdtx/lsru

And from the napari discussions https://github.com/napari/napari/issues/408#issuecomment-511214119

More info on Landsat 8 can be found at: https://landsat.gsfc.nasa.gov/landsat-8/mission-details/ I used the https://github.com/loicdtx/lsru to order the imagery, folks can also download Landsat 8 with https://earthexplorer.usgs.gov/

GenevieveBuckley avatar Jul 14 '19 23:07 GenevieveBuckley

cc @scottyhq who knows a bunch about landsat

(although, Scott is also a big xarray user, and we might want to avoid Xarray for this example in order to keep things focused on Dask Image)

mrocklin avatar Jul 15 '19 02:07 mrocklin

Thanks @mrocklin. Yes, we've used landsat8 for some examples since it is a public dataset on AWS and Google Cloud. Here is a blog post with some background: https://medium.com/pangeo/cloud-native-geoprocessing-of-earth-observation-satellite-data-with-pangeo-997692d91ca2, or if you just want to take a look at a notebook: https://github.com/scottyhq/esip-tech-dive/blob/master/notebooks/0-demo-aws.ipynb. As mentioned, these examples demonstrate using xarray integrated with dask.

scottyhq avatar Jul 15 '19 16:07 scottyhq

Thank you @scottyhq I'll take a look at those links and see if we can't get something up and running

GenevieveBuckley avatar Jul 15 '19 19:07 GenevieveBuckley

@jakirkham someone just asked me if we could use some of the images you link to from your blog post on loading image data . They might or might not be good candidates, but I ALSO notice that there is no licence with that data. Is that an oversight?

GenevieveBuckley avatar Aug 05 '19 23:08 GenevieveBuckley

More code from @timothywallaby, looping through a large image and appending to zarr: https://github.com/timothywallaby/dask/blob/master/OpenSlidetoZarr.ipynb

GenevieveBuckley avatar Aug 06 '19 06:08 GenevieveBuckley

@sofroniewn do you have the code you used to convert the Camelyon data to zarr?

jni avatar Aug 07 '19 07:08 jni

@sofroniewn do you have the code you used to convert the Camelyon data to zarr?

The code from @sofroniewn is here: https://github.com/sofroniewn/image-demos/blob/master/helpers/make_2D_zarr_pathology.py

The instructions were not to use it as is until we can work out why the saved file is bigger than the original tiff. Personally I also feel that for this purpose we don't really need the multilevel hierarchy, so that might make things a bit simpler.

GenevieveBuckley avatar Aug 07 '19 07:08 GenevieveBuckley

@jakirkham someone just asked me if we could use some of the images you link to from your blog post on loading image data .

@rxist525, what do you think? Would it be ok to use that data for code examples here?

jakirkham avatar Aug 12 '19 22:08 jakirkham

@jakirkham someone just asked me if we could use some of the images you link to from your blog post on loading image data .

@rxist525, what do you think? Would it be ok to use that data for code examples here?

Absolutely!!

rxist525 avatar Aug 13 '19 00:08 rxist525

@jakirkham someone just asked me if we could use some of the images you link to from your blog post on loading image data . They might or might not be good candidates, but I ALSO notice that there is no licence with that data. Is that an oversight?

you are welcome to use the data, which is part of a recent publication.

rxist525 avatar Aug 13 '19 00:08 rxist525

Is there a license for that dataset @rxist525?

GenevieveBuckley avatar Aug 13 '19 00:08 GenevieveBuckley

Is there a license for that dataset @rxist525?

good question, let me check and get back.

rxist525 avatar Aug 13 '19 00:08 rxist525

Also a potentially useful discussion: https://github.com/thewtex/fiber-bed-zarr/issues/1#issuecomment-595984988

GenevieveBuckley avatar Mar 08 '20 23:03 GenevieveBuckley

Is there a license for that dataset @rxist525?

good question, let me check and get back.

Just got off a call with Gokul earlier, he mentioned they've now added the CC BY-SA 4.0 license with the data. Though are potentially open to changing it if it causes issues. Feel free to correct me Gokul if needed.

jakirkham avatar Nov 16 '20 22:11 jakirkham

Apologies for dropping the ball on this - thanks for adding the note!

On Mon, Nov 16, 2020 at 2:44 PM jakirkham [email protected] wrote:

Is there a license for that dataset @rxist525 https://github.com/rxist525?

good question, let me check and get back.

Just got off a call with Gokul earlier, he mentioned they've now added the CC BY-SA 4.0 license with the data https://drive.google.com/drive/folders/1z1nB_DRgXYWwuUBEHYvj5hVotnAlR3W4. Though are potentially open to changing it if it causes issues. Feel free to correct me Gokul if needed.

โ€” You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-image/issues/107#issuecomment-728375126, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACE3CWIJZAENTXEY4RGDF2DSQGTOXANCNFSM4G3YDJSA .

rxist525 avatar Nov 16 '20 22:11 rxist525

Juan says on the napari zulip:

Talley's lattice dataset at bioimage archive has accession S-BSST435 See this page for how to access data, neither Talley nor I have actually tried to get it out yet :joy: https://www.ebi.ac.uk/biostudies/help

Link: https://www.ebi.ac.uk/biostudies/studies/S-BSST435

(Note: Volker tried to download the sample file, but couldn't unzip it properly. He thinks it was uploaded as a zip, which has also been zipped again by bioimage archive. He says if other people can access it to let him know. He has permission to use another lattice volume belonging to users he works with, but that's only a single volume.)

GenevieveBuckley avatar Nov 16 '20 23:11 GenevieveBuckley

@rxist525, do you have a workflow that you typically use on your data? If so, would you be able to share that as well? A birds eye view would be fine. Though a notebook would also be good if it exists ๐Ÿ™‚

jakirkham avatar Nov 24 '20 20:11 jakirkham

here is a pdf with a good overview of our workflow. Almost all of our data goes through pre-processing. Post-processing and analysis routines are biology/dataset dependent.

On Tue, Nov 24, 2020 at 12:37 PM jakirkham [email protected] wrote:

@rxist525 https://github.com/rxist525, do you have a workflow that you typically use on your data? If so, would you be able to share that as well? A birds eye view would be fine. Though a notebook would also be good if it exists ๐Ÿ™‚

โ€” You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-image/issues/107#issuecomment-733220076, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACE3CWN2ZUFMTTWPGYHUDYLSRQKRTANCNFSM4G3YDJSA .

rxist525 avatar Nov 27 '20 23:11 rxist525

Here's the pdf link https://drive.google.com/file/d/1q79pFcA_oSexcLPxUZMM2rpm_TeOFNKJ/view?usp=sharing : https://drive.google.com/file/d/1q79pFcA_oSexcLPxUZMM2rpm_TeOFNKJ/view?usp=sharing

On Fri, Nov 27, 2020 at 3:08 PM Gokul Upadhyayula [email protected] wrote:

here is a pdf with a good overview of our workflow. Almost all of our data goes through pre-processing. Post-processing and analysis routines are biology/dataset dependent.

On Tue, Nov 24, 2020 at 12:37 PM jakirkham [email protected] wrote:

@rxist525 https://github.com/rxist525, do you have a workflow that you typically use on your data? If so, would you be able to share that as well? A birds eye view would be fine. Though a notebook would also be good if it exists ๐Ÿ™‚

โ€” You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-image/issues/107#issuecomment-733220076, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACE3CWN2ZUFMTTWPGYHUDYLSRQKRTANCNFSM4G3YDJSA .

rxist525 avatar Nov 27 '20 23:11 rxist525

Thanks Gokul!

cc @grlee77 (in case this is of interest ๐Ÿ˜‰)

jakirkham avatar Nov 30 '20 19:11 jakirkham

Thanks Gokul! cc @grlee77 (in case this is of interest wink)

Yes, thank you @gokul. For context, I have been working on CUDA-based implementations of classical (i.e. not deep learning) image processing operations and algorithms as found in scipy.ndimage and scikit-image and it is helpful to have feedback on which things to prioritize. My background is in volumetric medical imaging (MRI) rather than microscopy, so it is useful to know what types of operations are being used in the microscopy field. I have a good idea of what deskew, rotation and deconvolution involve, but if you have specific references or methods regarding which kind of segmentation algorithms, etc. are typically used, that could also be of use.

Also, is image denoising often used during pre-processing steps or is the data you typically work with already of adequate SNR?

grlee77 avatar Nov 30 '20 20:11 grlee77

Great to connect with you Gregory. Typically, for quantitative work, we strive to generate data with sufficient SNR such that existing algorithms/workflows are compatible. While detection algorithms are more sensitive than our eye at the edge cases with low SNR, to convey our findings in movies, we typically denoise the data. We also use denoising as a pre-processing step to aid in segmentation. Hope this helps, if not, happy to discuss further.

On Mon, Nov 30, 2020 at 12:13 PM Gregory R. Lee [email protected] wrote:

Thanks Gokul! cc @grlee77 https://github.com/grlee77 (in case this is of interest wink)

Yes, thank you @gokul https://github.com/gokul. For context, I have been working on CUDA-based implementations of classical (i.e. not deep learning) image processing operations and algorithms as found in scipy.ndimage and scikit-image and it is helpful to have feedback on which things to prioritize. My background is in volumetric medical imaging (MRI) rather than microscopy, so it is useful to know what types of operations are being used in the microscopy field. I have a good idea of what deskew, rotation and deconvolution involve, but if you have specific references or methods regarding which kind of segmentation algorithms, etc. are typically used, that could also be of use.

Also, is image denoising often used during pre-processing steps or is the data you typically work with already of adequate SNR?

โ€” You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-image/issues/107#issuecomment-736016475, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACE3CWI52TO5F3LH7XKQ2FDSSP4GVANCNFSM4G3YDJSA .

rxist525 avatar Dec 07 '20 10:12 rxist525