NeMo-Curator fuzzy dedup in cpu

can you release a cpu veriosn in fuzzy dedup

Jun 07 '24 03:06 simplew2011

Are there any updates on this @sarahyurick? I would want to work on it.

Jul 18 '25 13:07 abdr17

Hi! No, there has been no work on this.

For more general context, we are currently moving Curator away from Dask in favor of Ray (working branch here if you are interested). This means our current GPU-based fuzzy deduplication is going to look pretty different after the Ray refactor.

I am not aware of any CPU-based fuzzy deduplication discussions happening right now. My main understanding is that since fuzzy deduplication is among the most expensive modules we support (even with GPU acceleration), a CPU version may be quite slow. Maybe @ayushdg can comment further?

Jul 18 '25 16:07 sarahyurick

Adding to what @sarahyurick said, we use special kernels for GPU dedup (GPU minhash, GPU shuffling/transfers, GPU connected components).

Given that the API for these components might be different than CPU variants that do something similar it might be hard to include the codepaths for both CPU & GPU fuzzy dedup within the same class/file.

If you do plan on working on a CPU based fuzzy dedup, I would recommend exploring it with Ray using the branch @sarahyurick mentioned and potentially creating new modules/files that handle CPU fuzzy dedup rather that logic to the existing Fuzzy Dedup code.

Jul 18 '25 17:07 ayushdg

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

Aug 18 '25 02:08 github-actions[bot]

btw why are we shifting to Ray from Dask ? Was there any issue with Dask? @sarahyurick @ayushdg

Sep 04 '25 12:09 abdr17

Hi @abdr17 great question. I can provide a high-level explanation here.

Performance was our main motivation for shifting to Ray. With a Ray backend we have seen improved performance across the board and especially with our most compute-heavy modules (fuzzy deduplication, classifiers, etc.). We are planning to share these new benchmarks in coming releases.

The main reason we are seeing these improvements is because the Ray backend allows for more control around resource (hardware) usage and data flow.

We are able to explicitly assign how many CPUs and GPUs should be reserved for a function (called a ProcessingStage) and reserve them accordingly. This is especially useful for heterogenous CPU and GPU pipelines, during which we are able to reduce overall machine idleness.
The data is split up into "batches" small enough to fit into memory (this is similar to Dask partitions). Each batch is run through the curation pipeline independently from the other data batches, meaning that as soon as a batch of data finishes stage A, it can immediately be moved to the queue for stage B, which already has separate hardware reserved for it.

Happy to discuss further or provide more concrete examples here.

Sep 04 '25 16:09 sarahyurick