NeMo-Curator icon indicating copy to clipboard operation
NeMo-Curator copied to clipboard

Faster/More efficient duplicate removal for exact/fuzzy dedup.

Open ayushdg opened this issue 1 year ago • 1 comments
trafficstars

Is your feature request related to a problem? Please describe. The current deduplication examples suggest compute on the list of duplicate documents produced via exact/fuzzy deduplication and use the computed list to filter out input documents. This doesn't work in cases where the duplicate list is too large and doesn't fit on the client. Ideally curator can provide additional classes/methods to remove duplicates from the list of duplicates list more efficiently.

Describe the solution you'd like A broadcast merge approach like the one suggested by @VibhuJawa works good enough at the 4-8TB scales where the duplicate list is small enough to be broadcasted to each worker and is worth implementing first. Longer term there might be a need for smarter partitioning of the duplicate list so that different files/subset can handle their own list of duplicates differently.

Describe alternatives you've considered N/A

Additional context The Zyda-2 tutorial and pre-training data tutorial both contain alternate approaches to compute since it's memory intensive.

ayushdg avatar Oct 29 '24 21:10 ayushdg