dolma icon indicating copy to clipboard operation
dolma copied to clipboard

Duplicate ids in Dolma v1.7

Open Vedaad-Shakib opened this issue 2 months ago • 0 comments

Hi,

While downloading and processing Dolma v1.7, I noticed that there are many duplicate samples with the same id field in the dataset. E.g. in the Project Gutenberg source, there are 175 duplicates that can be found by just looking at the id column. An example of a duplicate id is 8fddd3535f86e159339e1ff9be64fdda in the RefinedWeb split. This was surprising given that you had done significant deduping in Dolma 1.7. Is this a bug in the dataset?

Vedaad-Shakib avatar May 03 '24 20:05 Vedaad-Shakib