NeMo-Curator icon indicating copy to clipboard operation
NeMo-Curator copied to clipboard

[FEA] Add Sampling-Based Clustering in SemDedup

Open VibhuJawa opened this issue 9 months ago • 0 comments
trafficstars

Description We should add an option to perform clustering based on sampling in SemDedup, considering GPU memory constraints. Specifically, if sample_for_clustering=True, the system should:

  1. Perform sampling before clustering. The sampling ratio should be configurable, but by default, it should be dynamically inferred at runtime based on available GPU memory to optimize performance.
  2. Use the sampled data to fit a KMeans model.
  3. Apply the fitted KMeans model to cluster all of the data

This approach will enhance scalability and efficiency when dealing with large datasets.

Proposed Changes Introduce a sample_for_clustering parameter in ClusteringModel to enable sampling-based clustering.

  1. If sample_for_clustering=True, extract a representative sample from the embeddings dataset before fitting the KMeans model.
  2. Train KMeans on the sampled embeddings.
  3. Use the trained model to predict cluster assignments for the full dataset.
  4. Ensure this functionality is compatible with the current partitioning and memory management strategies.

Future Direction Explore the possibility of integrating sampling-based clustering directly within K-Means, eliminating the need for a two-step process.

VibhuJawa avatar Feb 11 '25 19:02 VibhuJawa