umap
umap copied to clipboard
UMAP on Billions of Data Points + Sample Weighting
Hi everyone,
I’m exploring the use of UMAP on a very large dataset (roughly 1-10 billion rows with 10–15 columns). I’m aware that fitting UMAP directly on such a large dataset is impossible, so here is my current plan:
- Round or bin the data (e.g., rounding to 1 decimal place or integer bins) to reduce granularity.
- Deduplicate the resulting rows while counting occurrences of each unique row (so each row is associated with a frequency/count).
This bring the dataset size down to the 10–50 million range.
Next, I want to incorporate the frequencies as sample weights (i.e., heavier weights for more frequent rows).
My question is: What is the best approach to incorporate sample weights into UMAP?
Some ideas I’ve considered include:
- Custom distance metric that factors in sample frequency.
- Precomputed distance matrix, although this might be infeasible for tens of millions of data points.
- Custom sampling strategy prior to or during UMAP’s neighbor-finding step.
I’d love to hear any suggestions, best practices, or experiences you’ve had with:
- Scaling UMAP to very large datasets (beyond straightforward sampling).
- Incorporating sample weights effectively in manifold learning.
- Approaches or code snippets that demonstrate custom distance metrics or neighbor selection based on weights.
Thanks in advance for any insights you can share.
I’m hoping this discussion will help me (and others) handle extremely large datasets more effectively with UMAP!
cc @lmcinnes