NeMo-Curator
NeMo-Curator copied to clipboard
Update deduplication scripts
trafficstars
When running the extract_dedup_data.py script, the user may encounter a warning:
UserWarning: Insufficient elements for `head`. 10 elements requested, only 0 elements available. Try passing larger `npartitions` to `head`.
warnings.warn(
Empty DataFrame
Columns: [cluster, cosine_dist_to_cent, doc_id]
Index: []
when print(dedup_id_dataset.df.head(10)) is called, if there are no semantic duplicates found. We should add an if/else block here to make this clearer to the user.
Additionally, the name semdedup_extract_unique_ids for this script is out of date and should be updated to reflect the script's functionality to return a list of IDs to be removed. The README for this should be updated as well.
While we are at it, we should double check that the fuzzy and exact deduplication scripts are clear to the user.