pynndescent icon indicating copy to clipboard operation
pynndescent copied to clipboard

Using Custom Distance

Open bhavaygg opened this issue 3 years ago • 3 comments

Hello,

I wanted to perform NN search on a dataset of genomes. For this task, the distance between 2 datapoints is calculated by a custom script? I looked at the documentation and sort of understood how to incorporate my distance but I have a doubts -

  • The data is a bunch of filenames that I use to run a script that generates the distance. How should I use this data with pynndescent.NNDescent()? Entering a list of strings obviously won't work but using a dictionary wouldn't be supported if I am not wrong.

bhavaygg avatar May 21 '21 23:05 bhavaygg

I'm afraid that at the moment you really need a 2D vector(ish) representation of data for this library to work. I do have some plans to hopefully one day extend it to allow more custom things, but even then distance functions need to be numba JIT compilable, and loading files is likely not going to meet that requirement. Sorry :-(

lmcinnes avatar May 27 '21 01:05 lmcinnes

Are the vectors used apart from distance calculator because if not I can use dictionary keys as vectors. I would not be loading files in python but calling a c++ script which would return distance. Can this work? Thanks anyway!

bhavaygg avatar May 27 '21 01:05 bhavaygg

It is slimly possible that you could manage to do this -- the catch being that the distance computation must be numba compilable. In theory you could write a distance function that uses ffi to call C++ functions directly and have that work. I have little experience with integrating ffi and numba so I don't know how easy that would be. Regardless I worry that the distance computation is going to be reading from disk for every single call, which is going to be terrifyingly slow, and make this approach unlikely to be worthwhile in the long run. To be honest you are likely to be better off writing C++ to do the brute force computation (if you write it intelligently, blockwise, you can save a vast amount of disk-reads).

lmcinnes avatar May 27 '21 16:05 lmcinnes