NeMo-Curator
NeMo-Curator copied to clipboard
[FEA] Add Sampling-Based Clustering in SemDedup
trafficstars
Description We should add an option to perform clustering based on sampling in SemDedup, considering GPU memory constraints. Specifically, if sample_for_clustering=True, the system should:
- Perform sampling before clustering. The sampling ratio should be configurable, but by default, it should be dynamically inferred at runtime based on available GPU memory to optimize performance.
- Use the sampled data to fit a KMeans model.
- Apply the fitted KMeans model to cluster all of the data
This approach will enhance scalability and efficiency when dealing with large datasets.
Proposed Changes Introduce a sample_for_clustering parameter in ClusteringModel to enable sampling-based clustering.
- If sample_for_clustering=True, extract a representative sample from the embeddings dataset before fitting the KMeans model.
- Train KMeans on the sampled embeddings.
- Use the trained model to predict cluster assignments for the full dataset.
- Ensure this functionality is compatible with the current partitioning and memory management strategies.
Future Direction Explore the possibility of integrating sampling-based clustering directly within K-Means, eliminating the need for a two-step process.