datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Add MedImg for streaming

Open lhallee opened this issue 1 year ago • 8 comments

Feature request

Host the MedImg dataset (similar to Imagenet but for biomedical images).

Motivation

There is a clear need for biomedical image foundation models and large scale biomedical datasets that are easily streamable. This would be an excellent tool for the biomedical community.

Your contribution

MedImg can be found here.

lhallee avatar May 22 '24 00:05 lhallee

@mariosasko, @lhoestq, @albertvillanova Hello! Can anyone help? or can you guys suggest who can help with this?

lhallee avatar May 31 '24 03:05 lhallee

Hi ! Feel free to download the dataset and create a Dataset object with it.

Then your'll be able to use push_to_hub() to upload the dataset to HF in Parquet format and make it streamable :)

lhoestq avatar May 31 '24 10:05 lhoestq

Hi ! Feel free to download the dataset and create a Dataset object with it.

Then your'll be able to use push_to_hub() to upload the dataset to HF in Parquet format and make it streamable :)

The dataset is several TB in total, which I do not have the resources to handle.

lhallee avatar Jun 03 '24 14:06 lhallee

Hi @lhoestq and @albertvillanova , just following up about this.

lhallee avatar Sep 05 '24 13:09 lhallee

for big datasets you can push_to_hub one part at a time (e.g. as different splits) and merge the parts (just a simple modification in the YAML part of the README)

lhoestq avatar Sep 05 '24 15:09 lhoestq

Sure, that makes sense. However, isn't there a size limit to what typical users can push?

lhallee avatar Sep 05 '24 16:09 lhallee

Yes there is a limit, simply let us know by email at datasets [at] huggingface.co - this way we can give you a storage grant also help making sure the dataset is all good for people to use it easily

lhoestq avatar Sep 05 '24 16:09 lhoestq

Yes there is a limit, simply let us know by email at datasets [at] huggingface.co - this way we can give you a storage grant also help making sure the dataset is all good for people to use it easily

Got it, that would be great.

lhallee avatar Sep 05 '24 16:09 lhallee