datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Providing dataset size

Open sashavor opened this issue 3 years ago • 3 comments

Is your feature request related to a problem? Please describe. Especially for big datasets like LAION, it's hard to know exactly the downloaded size (because there are many files and you don't have their exact size when downloaded).

Describe the solution you'd like Auto-populating the downloaded dataset size on the dataset page would be really useful, including that of each split (when there are some).

Describe alternatives you've considered People should be adding this to dataset cards, but I don't think that is systematically the case :slightly_smiling_face:

Additional context Mentioned to @lhoestq

sashavor avatar Sep 14 '22 13:09 sashavor

Hi @sashavor, thanks for your suggestion.

Until now we have the CLI command

datasets-cli test datasets/<your-dataset-folder> --save_infos --all_configs

that generates the dataset_infos.json with the size of the downloaded dataset, among other information.

We are currently in the middle of removing those JSON files and putting their information directly in the header of the README.md (as YAML tags). Normally, the CLI command should continue working but saving its output to the dataset card instead. See:

  • #4926

albertvillanova avatar Sep 15 '22 06:09 albertvillanova

Additionally, the download size can be inferred by doing HEAD requests to the files to be downloaded. And for files hosted on the hub you can even get the file sizes using the Hub API

lhoestq avatar Sep 15 '22 15:09 lhoestq

Amazing @albertvillanova ! I think just having that information visible in the dataset info (without having to do any requests/additional coding) would be really useful :hugs:

sashavor avatar Sep 15 '22 16:09 sashavor