NeMo-Curator
NeMo-Curator copied to clipboard
[BUG] Jaccard Shuffle error if shuffled_docs.parquet data already exists and has been written.
trafficstars
Describe the bug
Calling jaccard_shuffle on an output directory that already contains shuffle docs from a previous run leads to errors
assert bucket_part_start_offset % parts_per_bucket_batch == 0
AssertionError