uwot icon indicating copy to clipboard operation
uwot copied to clipboard

Fixes for 1.0

Open jlmelville opened this issue 5 years ago • 1 comments

Things I should fix, but which may need a major version change. To be edited and updated as I discover more hidden horrors.

  • min_dist default is 0.01, but should be 0.1 for consistency with Python UMAP. Fortunately, this has no discernible effect on the output.
  • should pca be set by default? If users attempt to throw very high dimensional data at uwot at the moment, they are in for a miserable time, because at best Annoy will take hours to complete. At worst, if they are using multi-threading (also a default), Annoy will fail on large datasets due to not being able to read back in an index larger in size than 2GB. I must get back to rnndescent and add rp tree support to provide a replacement/alternative.

jlmelville avatar Apr 09 '19 04:04 jlmelville

As of RcppAnnoy 0.0.15, large indices can now be read back in. Still takes ages to search them, but that's one problem solved.

jlmelville avatar Mar 09 '20 03:03 jlmelville

Other things that could change:

  • init_sdev = "range" by default.
  • if batch = TRUE, we can set n_sgd_threads to the same value as n_threads
  • I almost always ignore the default n_epochs = 200 for small and n_epochs = 500 for large datasets and just use n_epochs = 500 to be consistent. So although that deviates from the Python UMAP default, I'm inclined to change that.
  • If installed, HNSW should be the default nearest neighbor method for everything (if using a suitable metric). Again, it doesn't seem worth the inconsistency of doing exact nearest neighbors for small datasets (especially as it can be slow for high-dimensional datasets like COIL-20, which are nonetheless considered "small").
  • May also want to use rnndescent in preference to RcppAnnoy if it's available but RcppHNSW isn't.

This will need a separate umap2 function to avoid breaking other people's code.

jlmelville avatar Mar 18 '24 02:03 jlmelville

#123 adds these changes, including umap2. Also batch = TRUE is a default. I am likely to set this is version 0.2 when submitted to CRAN.

jlmelville avatar Apr 14 '24 03:04 jlmelville