Providing dataset size
Is your feature request related to a problem? Please describe. Especially for big datasets like LAION, it's hard to know exactly the downloaded size (because there are many files and you don't have their exact size when downloaded).
Describe the solution you'd like Auto-populating the downloaded dataset size on the dataset page would be really useful, including that of each split (when there are some).
Describe alternatives you've considered People should be adding this to dataset cards, but I don't think that is systematically the case :slightly_smiling_face:
Additional context Mentioned to @lhoestq
Hi @sashavor, thanks for your suggestion.
Until now we have the CLI command
datasets-cli test datasets/<your-dataset-folder> --save_infos --all_configs
that generates the dataset_infos.json with the size of the downloaded dataset, among other information.
We are currently in the middle of removing those JSON files and putting their information directly in the header of the README.md (as YAML tags). Normally, the CLI command should continue working but saving its output to the dataset card instead. See:
- #4926
Additionally, the download size can be inferred by doing HEAD requests to the files to be downloaded. And for files hosted on the hub you can even get the file sizes using the Hub API
Amazing @albertvillanova ! I think just having that information visible in the dataset info (without having to do any requests/additional coding) would be really useful :hugs: