data-preparation icon indicating copy to clipboard operation
data-preparation copied to clipboard

Question about ROOTS corpus: availability & earlier web data

Open HarideP opened this issue 3 months ago • 0 comments

Hi ROOTS / BigScience,

First, many thanks for ROOTS — it's an awesome multilingual dataset that’s super helpful.

I have a few questions:

Is there a way to access the full ROOTS corpus (beyond the “large initial subset”)? Or is the full version publicly downloadable?

Does anyone know whether ROOTS or related BigScience projects have plans or workflows for collecting web text from before 2008? Any archives, tools, or datasets people have used for that time period?

If I wanted to combine ROOTS with other historical web datasets (or reconstruct earlier web snapshots), would the preprocessing / filtering tools from the data-preparation GitHub repo be helpful for that?

Thanks a lot for any pointers or suggestions!

Best, Patrick

HarideP avatar Sep 24 '25 14:09 HarideP