uwot
uwot copied to clipboard
Rolling out uwot's C++ code as a header-only library
I wonder whether this would be of interest; to squeeze out the C++ code in here to a separate header-only library, in much the same way that https://github.com/LTLA/CppIrlba contains the relevant contents of irlba. Mostly so that I can use it for other applications without the challenge of dragging in R (or Python) runtimes. And then you could chuck the library into inst/include
and we would be able to share single implementation with relative ease.
I was planning to give it a go on the weekend. Will need to strip out all the Rcpp stuff, I don't know how pervasive that is. Will also need to add a "no-parallel" option that avoids any calls to <thread>
as my target system's support for that is kinda wonky.
To the extent it's possible, that's a good idea (I subsequently did a better job of keeping R-specifics separate in rnndescent). This is something I thought about doing at some point in the future, but the main reason I never took it more seriously is because the pure C++ parts aren't very useful on their own (also I don't know any CMake). The nearest neighbor calculations and initialization all have to be provided separately, so you're really just getting the optimization bit. If that's useful to you, I'm happy to provide what assistance I can.
Yep, I was going to supply the NN's myself (https://github.com/LTLA/knncolle). On a tangentially related note, it would be nice to make a pure C++ port of nndescent available from that interface. Would be happy to help out there if you're interested.
The initialization... is within the realm of feasibility. Could link to Spectra, or could modify CppIrlba to handle smallest = TRUE
. Not quite sure which one is less work - will have to try it out. Was there a reason for the use of Spectra as the default?
Anyway, testing out the initialization is probably a solid weekend project on my side. If you have the bandwidth, maybe you could reorganize the stuff across src/
and inst/include
to create a pure C++ interface to your optimization code. Then we might eventually be able to plug and play with all the three components (NN, init, optim).
Yep, I was going to supply the NN's myself (https://github.com/LTLA/knncolle). On a tangentially related note, it would be nice to make a pure C++ port of nndescent available from that interface. Would be happy to help out there if you're interested.
That can happen too... eventually.
The initialization... is within the realm of feasibility. Could link to Spectra, or could modify CppIrlba to handle
smallest = TRUE
. Not quite sure which one is less work - will have to try it out. Was there a reason for the use of Spectra as the default?
At the time, the irlba partial_eigen
was described as "somewhat experimental" and in practice was a lot slower than using RSpectra. Maybe that's changed now.
Anyway, testing out the initialization is probably a solid weekend project on my side. If you have the bandwidth, maybe you could reorganize the stuff across
src/
andinst/include
to create a pure C++ interface to your optimization code. Then we might eventually be able to plug and play with all the three components (NN, init, optim).
Not sure about timelines but I will start taking a look and see if this seems achievable in some reasonable amount of time or if it's going to reveal some larger structural changes will be required.
I failed to make any progress this weekend, but it is closer to the top of my to-do pile.
No worries. I also failed to make any progress as well, got distracted by https://github.com/LTLA/qdtsne.
Made a start on the initialization: https://github.com/LTLA/umappp.
The most that can be said right now is that it compiles and runs.
Sorry I have made zero contributions to this so far. I was traveling for the last two weeks and had little to no internet access.
No problems whatsoever - it is, in fact, already done! The code in uwot's inst/include
was easier to read than I thought, so it was fairly straightforward to get what I needed. Check it out:
Close enough, I'd say. I know we're identical up to the optimization, so I'm guessing that the differences are due to our different PRNGs - I'm using std::mt19937_64
to avoid the need to manage another dependency.
I didn't add any of the other bells and whistles, e.g., no support for supervised training, no support for tumap
or largevis
. I don't need them personally, but I could work on that if we wanted to turn uwot into an R wrapper around a fully-featured C++ library. Interested to hear your thoughts here - I don't mind either way.
In the meantime, I'll post a few more issues on things I discovered along the way.
Close enough, I'd say. I know we're identical up to the optimization, so I'm guessing that the differences are due to our different PRNGs - I'm using
std::mt19937_64
to avoid the need to manage another dependency.
Does changing the PRNG away from the Tausworthe88 have an effect on the speed?
I didn't add any of the other bells and whistles, e.g., no support for supervised training, no support for
tumap
orlargevis
. I don't need them personally, but I could work on that if we wanted to turn uwot into an R wrapper around a fully-featured C++ library. Interested to hear your thoughts here - I don't mind either way.
It would be a shame to not use UMAPPP if possible. I wouldn't want to weigh it down with features that not many people care about (I have zero idea if anyone makes use of tumap
or largevis
), but I also don't want to keep track of two separate but similar C++ code bases (although they haven't actually changed very much).
tumap
is great 😅