NeMo-Curator icon indicating copy to clipboard operation
NeMo-Curator copied to clipboard

Update deduplication scripts

Open sarahyurick opened this issue 6 months ago • 0 comments
trafficstars

When running the extract_dedup_data.py script, the user may encounter a warning:

UserWarning: Insufficient elements for `head`. 10 elements requested, only 0 elements available. Try passing larger `npartitions` to `head`.
  warnings.warn(
Empty DataFrame
Columns: [cluster, cosine_dist_to_cent, doc_id]
Index: []

when print(dedup_id_dataset.df.head(10)) is called, if there are no semantic duplicates found. We should add an if/else block here to make this clearer to the user.

Additionally, the name semdedup_extract_unique_ids for this script is out of date and should be updated to reflect the script's functionality to return a list of IDs to be removed. The README for this should be updated as well.

While we are at it, we should double check that the fuzzy and exact deduplication scripts are clear to the user.

sarahyurick avatar May 19 '25 19:05 sarahyurick