dask-image
dask-image copied to clipboard
Example image data for dask-image
dask-image example datasets
We need some good example data for tutorials with dask-image.
This issue is a place for discussion and suggestions. If you have links, add them here!
Ideally this data should:
- Have a permissive license
- Be easily downloaded on demand by the user
- Be big, but not too big. We want something that will automatically be spread over a few dask chunks, but not too large to download. 1 - 2 GB? What makes sense here?
It would be nice to have
- scientific images (microscopy, astronomy, satellite/geo-spatial images, maybe histology slides)
- a filetype we don't need another third party library to open (we have pims already, something it could handle would be ideal)
- the data hosted by someone else, but in a stable situation where we can reasonably count on that continuing
What we want to avoid:
- Websites that make you register (even for free) before you can download data.
EDIT: https://github.com/napari/napari/issues/316
Just saw this tweet announcing a human brain MRI at 100ยตm isotropic resolution. This could be a very cool dataset to use as a napari demo. I suggest we use this issue to keep track of datasets that we could put in napari once we have proper data downloading. Please just edit the checklist below to add your preferred demo data.
* [ ] 100ยตm resolution human brain: https://twitter.com/ComaRecoveryLab/status/1134436231775961088 * [ ] 10m resolution vegetation cover in Victoria: http://francois-petitjean.com/Research/MonashVegMap/info.php and https://labo.obs-mip.fr/multitemp/mapping-a-part-of-australia-at-10-m-resolution/ * [ ] correlative superres https://www.biorxiv.org/content/10.1101/773986v1.abstract * [x] SARS-CoV2 in gut epithelium https://twitter.com/notjustmoore/status/1256232842755014656 * [ ] developing sea squirt https://www.nytimes.com/2020/07/09/science/sea-squirts-embryos.html * [ ] mechanobiology of intestinal organoids https://twitter.com/XavierTrepat/status/1308026944349450241 * [ ] tracking of particles on astral microtubules ([paper](https://www.biorxiv.org/content/10.1101/2020.06.17.154260v1), [tweet (๐)](https://twitter.com/the_Node/status/1341050276011237379)), could make a really neat demo for the tracks layer. * [ ] [Sentinel-2 1y Cloud optimised geotiff dataset](https://medium.com/sentinel-hub/digital-twin-sandbox-sentinel-2-collection-available-to-everyone-20f3b5de846e) * [ ] Calcium imaging in the Drosophila ellipsoid body ([2013](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3830704/) and [2015](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4704792/)) * [ ] Janelia FlyLight data ([AWS](https://registry.opendata.aws/janelia-flylight/)) * [ ] Fruit Fly Brain Observatory (FFBO) ([tweet](https://twitter.com/FlyBrainObs/status/1369496338266750977)) * [ ] Janelia [Open Organelle datasets](https://openorganelle.janelia.org/) * [ ] CZBiohub [Open Cell](https://opencell.czbiohub.org/about) * [ ] [CoCo](https://cocodataset.org/#download) + [Voxel51 datasets](https://voxel51.com/docs/fiftyone/user_guide/using_datasets.html) * [ ] @tlambert03's lattice light sheet dataset used in the dask application post https://www.ebi.ac.uk/biostudies/studies/S-BSST435?query=Talley%20lambert
two cryo-ET datasets to add to the pile
- https://zenodo.org/record/6504891 - deconvolved tomogram of HIV virus-like particles + annotations
- https://zenodo.org/record/6504949 - denoised tomogram of a M. pneumonaie cell
There is also some more developing Tribolium embryos, mouse brain slices and mouse colon volumes:
https://zenodo.org/record/4276076#.YYJKMWDMJaR
https://github.com/napari/napari/issues/316#issuecomment-952642188
... and this 3D cell tracking dataset is gorgeous! http://celltrackingchallenge.net/
C.elegans developing embryo Waterston Lab, University of Washington, Seattle, WA, USA Training dataset: http://data.celltrackingchallenge.net/training-datasets/Fluo-N3DH-CE.zipโฑ (3.1 GB) Challenge dataset: http://data.celltrackingchallenge.net/challenge-datasets/Fluo-N3DH-CE.zip (1.7 GB)
Microscope: Zeiss LSM 510 Meta Objective lens: Plan-Apochromat 63x/1.4 (oil) Voxel size (microns): 0.09 x 0.09 x 1.0 Time step (min): 1 (1.5) Additional information: Nature Methods, 2008
The Cancer Genome Atlas Database might work: https://portal.gdc.cancer.gov/
There are some histology images there that could fit the requirements. They do have file formats that would probably need a third party library to read into python, but you can download individual images separately pretty easily.
The xarray examples (like this one, or this one) sometimes uses NetCDF climate data from the Climate Data Store. Website: https://cds.climate.copernicus.eu
- A user login might be required in all cases
- There is a python API for data downloads
- xarray can open NetCDF files fairly easily if netCDF4 is also installed. This might be a lot easier than trying to fiddle with user java installations for python-bioformats.
https://towardsdatascience.com/handling-netcdf-files-using-xarray-for-absolute-beginners-111a8ab4463f
Nick took a histology CC0 image and converted it to zarr - https://camelyon16.grand-challenge.org/Data/
Edit: updated link - https://camelyon17.grand-challenge.org/Data/
The landsat data might also be good https://landsat.gsfc.nasa.gov/data/
Some people have said they think landsat is CC0 licensed, but I haven't found that page on the website yet so we better double check.
Here's a wrapper around the API to make it easier: https://github.com/loicdtx/lsru
And from the napari discussions https://github.com/napari/napari/issues/408#issuecomment-511214119
More info on Landsat 8 can be found at: https://landsat.gsfc.nasa.gov/landsat-8/mission-details/ I used the https://github.com/loicdtx/lsru to order the imagery, folks can also download Landsat 8 with https://earthexplorer.usgs.gov/
cc @scottyhq who knows a bunch about landsat
(although, Scott is also a big xarray user, and we might want to avoid Xarray for this example in order to keep things focused on Dask Image)
Thanks @mrocklin. Yes, we've used landsat8 for some examples since it is a public dataset on AWS and Google Cloud. Here is a blog post with some background: https://medium.com/pangeo/cloud-native-geoprocessing-of-earth-observation-satellite-data-with-pangeo-997692d91ca2, or if you just want to take a look at a notebook: https://github.com/scottyhq/esip-tech-dive/blob/master/notebooks/0-demo-aws.ipynb. As mentioned, these examples demonstrate using xarray integrated with dask.
Thank you @scottyhq I'll take a look at those links and see if we can't get something up and running
@jakirkham someone just asked me if we could use some of the images you link to from your blog post on loading image data . They might or might not be good candidates, but I ALSO notice that there is no licence with that data. Is that an oversight?
More code from @timothywallaby, looping through a large image and appending to zarr: https://github.com/timothywallaby/dask/blob/master/OpenSlidetoZarr.ipynb
@sofroniewn do you have the code you used to convert the Camelyon data to zarr?
@sofroniewn do you have the code you used to convert the Camelyon data to zarr?
The code from @sofroniewn is here: https://github.com/sofroniewn/image-demos/blob/master/helpers/make_2D_zarr_pathology.py
The instructions were not to use it as is until we can work out why the saved file is bigger than the original tiff. Personally I also feel that for this purpose we don't really need the multilevel hierarchy, so that might make things a bit simpler.
@jakirkham someone just asked me if we could use some of the images you link to from your blog post on loading image data .
@rxist525, what do you think? Would it be ok to use that data for code examples here?
@jakirkham someone just asked me if we could use some of the images you link to from your blog post on loading image data .
@rxist525, what do you think? Would it be ok to use that data for code examples here?
Absolutely!!
@jakirkham someone just asked me if we could use some of the images you link to from your blog post on loading image data . They might or might not be good candidates, but I ALSO notice that there is no licence with that data. Is that an oversight?
you are welcome to use the data, which is part of a recent publication.
Is there a license for that dataset @rxist525?
Is there a license for that dataset @rxist525?
good question, let me check and get back.
Also a potentially useful discussion: https://github.com/thewtex/fiber-bed-zarr/issues/1#issuecomment-595984988
Is there a license for that dataset @rxist525?
good question, let me check and get back.
Just got off a call with Gokul earlier, he mentioned they've now added the CC BY-SA 4.0 license with the data. Though are potentially open to changing it if it causes issues. Feel free to correct me Gokul if needed.
Apologies for dropping the ball on this - thanks for adding the note!
On Mon, Nov 16, 2020 at 2:44 PM jakirkham [email protected] wrote:
Is there a license for that dataset @rxist525 https://github.com/rxist525?
good question, let me check and get back.
Just got off a call with Gokul earlier, he mentioned they've now added the CC BY-SA 4.0 license with the data https://drive.google.com/drive/folders/1z1nB_DRgXYWwuUBEHYvj5hVotnAlR3W4. Though are potentially open to changing it if it causes issues. Feel free to correct me Gokul if needed.
โ You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-image/issues/107#issuecomment-728375126, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACE3CWIJZAENTXEY4RGDF2DSQGTOXANCNFSM4G3YDJSA .
Juan says on the napari zulip:
Talley's lattice dataset at bioimage archive has accession S-BSST435 See this page for how to access data, neither Talley nor I have actually tried to get it out yet :joy: https://www.ebi.ac.uk/biostudies/help
Link: https://www.ebi.ac.uk/biostudies/studies/S-BSST435
(Note: Volker tried to download the sample file, but couldn't unzip it properly. He thinks it was uploaded as a zip, which has also been zipped again by bioimage archive. He says if other people can access it to let him know. He has permission to use another lattice volume belonging to users he works with, but that's only a single volume.)
@rxist525, do you have a workflow that you typically use on your data? If so, would you be able to share that as well? A birds eye view would be fine. Though a notebook would also be good if it exists ๐
here is a pdf with a good overview of our workflow. Almost all of our data goes through pre-processing. Post-processing and analysis routines are biology/dataset dependent.
On Tue, Nov 24, 2020 at 12:37 PM jakirkham [email protected] wrote:
@rxist525 https://github.com/rxist525, do you have a workflow that you typically use on your data? If so, would you be able to share that as well? A birds eye view would be fine. Though a notebook would also be good if it exists ๐
โ You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-image/issues/107#issuecomment-733220076, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACE3CWN2ZUFMTTWPGYHUDYLSRQKRTANCNFSM4G3YDJSA .
Here's the pdf link https://drive.google.com/file/d/1q79pFcA_oSexcLPxUZMM2rpm_TeOFNKJ/view?usp=sharing : https://drive.google.com/file/d/1q79pFcA_oSexcLPxUZMM2rpm_TeOFNKJ/view?usp=sharing
On Fri, Nov 27, 2020 at 3:08 PM Gokul Upadhyayula [email protected] wrote:
here is a pdf with a good overview of our workflow. Almost all of our data goes through pre-processing. Post-processing and analysis routines are biology/dataset dependent.
On Tue, Nov 24, 2020 at 12:37 PM jakirkham [email protected] wrote:
@rxist525 https://github.com/rxist525, do you have a workflow that you typically use on your data? If so, would you be able to share that as well? A birds eye view would be fine. Though a notebook would also be good if it exists ๐
โ You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-image/issues/107#issuecomment-733220076, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACE3CWN2ZUFMTTWPGYHUDYLSRQKRTANCNFSM4G3YDJSA .
Thanks Gokul!
cc @grlee77 (in case this is of interest ๐)
Thanks Gokul! cc @grlee77 (in case this is of interest wink)
Yes, thank you @gokul. For context, I have been working on CUDA-based implementations of classical (i.e. not deep learning) image processing operations and algorithms as found in scipy.ndimage and scikit-image and it is helpful to have feedback on which things to prioritize. My background is in volumetric medical imaging (MRI) rather than microscopy, so it is useful to know what types of operations are being used in the microscopy field. I have a good idea of what deskew, rotation and deconvolution involve, but if you have specific references or methods regarding which kind of segmentation algorithms, etc. are typically used, that could also be of use.
Also, is image denoising often used during pre-processing steps or is the data you typically work with already of adequate SNR?
Great to connect with you Gregory. Typically, for quantitative work, we strive to generate data with sufficient SNR such that existing algorithms/workflows are compatible. While detection algorithms are more sensitive than our eye at the edge cases with low SNR, to convey our findings in movies, we typically denoise the data. We also use denoising as a pre-processing step to aid in segmentation. Hope this helps, if not, happy to discuss further.
On Mon, Nov 30, 2020 at 12:13 PM Gregory R. Lee [email protected] wrote:
Thanks Gokul! cc @grlee77 https://github.com/grlee77 (in case this is of interest wink)
Yes, thank you @gokul https://github.com/gokul. For context, I have been working on CUDA-based implementations of classical (i.e. not deep learning) image processing operations and algorithms as found in scipy.ndimage and scikit-image and it is helpful to have feedback on which things to prioritize. My background is in volumetric medical imaging (MRI) rather than microscopy, so it is useful to know what types of operations are being used in the microscopy field. I have a good idea of what deskew, rotation and deconvolution involve, but if you have specific references or methods regarding which kind of segmentation algorithms, etc. are typically used, that could also be of use.
Also, is image denoising often used during pre-processing steps or is the data you typically work with already of adequate SNR?
โ You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-image/issues/107#issuecomment-736016475, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACE3CWI52TO5F3LH7XKQ2FDSSP4GVANCNFSM4G3YDJSA .