NeMo-Curator icon indicating copy to clipboard operation
NeMo-Curator copied to clipboard

[BUG] Jaccard Shuffle error if merge result is empty

Open ayushdg opened this issue 1 year ago • 3 comments
trafficstars

Describe the bug

If the merge result b/w text and bucket mapping df is empty for any iteration the logic fails.

Failure is observed here but originates from https://github.com/NVIDIA/NeMo-Curator/blob/fe9fd6f46a932689ba036c623b2737298478c8ea/nemo_curator/utils/fuzzy_dedup_utils/shuffle_utils.py#L144 being empty. Still working on a minimal repro.

Additional context

The fix should be to continue on with the loop if this is a 0 len df.

Error here looks like ValueError: zero-size array to reduction operation maximum which has no identity

ayushdg avatar May 02 '24 23:05 ayushdg