nested loop sequential or parallelization runtime

Open lyelibi opened this issue 3 years ago • 1 comments

I have recently started using Pynndescent to create nearest neighbor graphs. It performs much faster and is more stable than sklearn nearest neighbor graph alternative. The problem I am trying to solve is about scale: I generate a data-set whose columns are used to compute nearest neighbor graphs which means that for a 500 x 50 matrix (the example below) I have to compute 50*(50-1)/2 nearest neighbor graphs which is 1225 calls to pynndescent. This takes 9 seconds on my machine which is very limiting because I would like to do this for significantly larger data-sets (i.e. N = 1000 as opposed to 50 ).

Are there ways to optimize and do something else than a for loop? I serialize instead of the naive nested loop but python threading and multiprocessing haven't produced better result than the solution below.

data = np.random.normal(0,1, (500,50) )

  idx = list(itertools.combinations( range(data.shape[0]),2))

  count = len(idx)

  s = np.empty( N , dtype=np.float16 )


    for i in range(count):
        s[i] = NNDescent(data[:,idx[i]], metric='euclidean', n_neighbors = 5).neighbor_graph[1].sum()

Jun 03 '22 05:06 lyelibi

Off the top of my head I don't see any obvious ways to improve this significantly. There is just a lot of computation work to be done, and I'm not sure there are ways around that. You would have to look into deeper algorithmic approaches to cut down on work I think.

Jun 04 '22 18:06 lmcinnes