NeMo-Curator icon indicating copy to clipboard operation
NeMo-Curator copied to clipboard

[BUG] Jaccard Shuffle error if shuffled_docs.parquet data already exists and has been written.

Open ayushdg opened this issue 1 year ago • 0 comments
trafficstars

Describe the bug

Calling jaccard_shuffle on an output directory that already contains shuffle docs from a previous run leads to errors

 assert bucket_part_start_offset % parts_per_bucket_batch == 0
AssertionError

ayushdg avatar May 03 '24 23:05 ayushdg