pykdtree icon indicating copy to clipboard operation
pykdtree copied to clipboard

Compare performance to faiss

Open redhog opened this issue 10 months ago • 6 comments

Add https://github.com/facebookresearch/faiss to the performance comparison matrix

redhog avatar Oct 05 '23 11:10 redhog

It wasn't clear to me what an equivalent workflow would be for using faiss. I don't think the originally benchmarking code in the README was put into git so we might have to come up with an equivalent piece of code.

djhoese avatar Oct 05 '23 13:10 djhoese

This is how I have compared scipy:s cKDTree to Faiss previously:

import faiss
import scipy.spatial
import numpy as np
import pandas as pd
import datetime


d = 64                           # dimension
nb = 100000                      # database size
nq = 10000                       # nb of queries
np.random.seed(1234)             # make reproducible
xb = np.random.random((nb, d)).astype('float32')
xb[:, 0] += np.arange(nb) / 1000.
xq = np.random.random((nq, d)).astype('float32')
xq[:, 0] += np.arange(nq) / 1000.

k = 4

fstart = datetime.datetime.now()
for a in range(10):
    index = faiss.IndexFlatL2(d)   # build the index
    index.add(xb)                  # add vectors to the index
    fD, fI = index.search(xq, k)     # actual search
    fend = datetime.datetime.now()
flen = fend - fstart

tstart = datetime.datetime.now()
for a in range(10):
    tree = scipy.spatial.cKDTree(xb)
    tD, tI = tree.query(xq, k)
tend = datetime.datetime.now()
tlen = tend - tstart

print("scipy.spatial.cKDTree", tlen, "faiss.IndexFlatL2", flen)

This should be directly portable to pykdtree:s query() function.

redhog avatar Oct 05 '23 14:10 redhog

To be fair faiss has many other indices than L2, and faiss (and scipy?) are aimed at high dimensional data.

It would be nice to have the comparison code checked into git, say in a docs or contrib folder.

redhog avatar Oct 05 '23 14:10 redhog

Yes, I'm all for putting the code in the repository. Can L2 handle any dimensionality? Pykdtree's README used 3D data.

djhoese avatar Oct 05 '23 15:10 djhoese

Yeah, faiss.IndexFlatL2 can handle any dimension. I think it's uncommon to use faiss for 2d and 3d data, much more common for > 100 dimensions (word embeddings etc). scipy.spatial.cKDTree can also handle multidimensional data, but I think it's more common to use it for spatial (2d, 3d) data, that's at least what I've been using it for.

redhog avatar Oct 06 '23 08:10 redhog

Yeah pykdtree is used heavily by the pyresample and satpy projects I maintain and they are entirely used for Earth geolocation (X, Y, Z). I don't think I'll have time to dedicate to running these comparisons and collecting the statistics for each variation so if someone else has time for this that'd be great. I think the timeit module should be used to run each case multiple time and each should probably be run in a separate script to avoid the memory usage by one persist and affecting the next.

That is of course only testing execution time. It'd be nice to have an estimate on memory usage too.

djhoese avatar Oct 06 '23 15:10 djhoese