NeMo-Curator icon indicating copy to clipboard operation
NeMo-Curator copied to clipboard

Remove dependency on `convert_str_id_to_int` in FuzzyDedup Scripts

Open praateekmahajan opened this issue 11 months ago • 0 comments

Is your feature request related to a problem? Please describe. During the minhash script we implicitly convert str id to 2 int ids (doc_id + dataset_id). This is different from the FuzzyDuplicate(...) api where no such conversion is performed. In an ideal world, we don't rely on conversion of a string to int ids.

Describe the solution you'd like A clear and concise description of what you want to happen.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

praateekmahajan avatar Dec 20 '24 12:12 praateekmahajan