uwot
uwot copied to clipboard
Fixes for 1.0
Things I should fix, but which may need a major version change. To be edited and updated as I discover more hidden horrors.
-
min_dist
default is0.01
, but should be0.1
for consistency with Python UMAP. Fortunately, this has no discernible effect on the output. - should
pca
be set by default? If users attempt to throw very high dimensional data atuwot
at the moment, they are in for a miserable time, because at best Annoy will take hours to complete. At worst, if they are using multi-threading (also a default), Annoy will fail on large datasets due to not being able to read back in an index larger in size than 2GB. I must get back to rnndescent and add rp tree support to provide a replacement/alternative.
As of RcppAnnoy 0.0.15, large indices can now be read back in. Still takes ages to search them, but that's one problem solved.
Other things that could change:
-
init_sdev = "range"
by default. - if
batch = TRUE
, we can setn_sgd_threads
to the same value asn_threads
- I almost always ignore the default
n_epochs = 200
for small andn_epochs = 500
for large datasets and just usen_epochs = 500
to be consistent. So although that deviates from the Python UMAP default, I'm inclined to change that. - If installed, HNSW should be the default nearest neighbor method for everything (if using a suitable metric). Again, it doesn't seem worth the inconsistency of doing exact nearest neighbors for small datasets (especially as it can be slow for high-dimensional datasets like COIL-20, which are nonetheless considered "small").
- May also want to use
rnndescent
in preference toRcppAnnoy
if it's available butRcppHNSW
isn't.
This will need a separate umap2
function to avoid breaking other people's code.
#123 adds these changes, including umap2
. Also batch = TRUE
is a default. I am likely to set this is version 0.2 when submitted to CRAN.