data-preparation icon indicating copy to clipboard operation
data-preparation copied to clipboard

Mismatch of the Available Data Quantity on Huggingface

Open cll-mtk opened this issue 1 year ago • 0 comments

I tried to download English part of Roots these days. According to the paper, there are 484,953,009,124 bytes of English data. However, after downloading all roots-related datasets on huggingface by filtering, I found there is only about 43.8 GB data. I wonder how to explain the difference? Are those huggingface datasets only a subset of Roots? Are those huggingface datasets processed Roots so that the quantity shrinks from 480 GB to 43.8 GB?

cll-mtk avatar May 02 '23 05:05 cll-mtk