pynndescent
pynndescent copied to clipboard
Using Custom Distance
Hello,
I wanted to perform NN search on a dataset of genomes. For this task, the distance between 2 datapoints is calculated by a custom script? I looked at the documentation and sort of understood how to incorporate my distance but I have a doubts -
- The data is a bunch of filenames that I use to run a script that generates the distance. How should I use this data with
pynndescent.NNDescent()
? Entering a list of strings obviously won't work but using a dictionary wouldn't be supported if I am not wrong.
I'm afraid that at the moment you really need a 2D vector(ish) representation of data for this library to work. I do have some plans to hopefully one day extend it to allow more custom things, but even then distance functions need to be numba JIT compilable, and loading files is likely not going to meet that requirement. Sorry :-(
Are the vectors used apart from distance calculator because if not I can use dictionary keys as vectors. I would not be loading files in python but calling a c++ script which would return distance. Can this work? Thanks anyway!
It is slimly possible that you could manage to do this -- the catch being that the distance computation must be numba compilable. In theory you could write a distance function that uses ffi to call C++ functions directly and have that work. I have little experience with integrating ffi and numba so I don't know how easy that would be. Regardless I worry that the distance computation is going to be reading from disk for every single call, which is going to be terrifyingly slow, and make this approach unlikely to be worthwhile in the long run. To be honest you are likely to be better off writing C++ to do the brute force computation (if you write it intelligently, blockwise, you can save a vast amount of disk-reads).