NeMo-Curator
NeMo-Curator copied to clipboard
Remove dependency on `convert_str_id_to_int` in FuzzyDedup Scripts
Is your feature request related to a problem? Please describe.
During the minhash script we implicitly convert str id to 2 int ids (doc_id + dataset_id). This is different from the FuzzyDuplicate(...) api where no such conversion is performed. In an ideal world, we don't rely on conversion of a string to int ids.
Describe the solution you'd like A clear and concise description of what you want to happen.
Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.
Additional context Add any other context or screenshots about the feature request here.