annoy add_item on tensorflow tensor is extremely slow

Consider the following code:

from annoy import AnnoyIndex
import tensorflow as tf
from time import perf_counter

tf.compat.v1.enable_eager_execution()

dims = 1792
trees = 10000
features = []

for key in range(0, 100):
    features.append(tf.random.uniform([dims]))

t1 = perf_counter()

t = AnnoyIndex(dims, metric='angular')

for key, feature in enumerate(features):
    t.add_item(key, feature)

t2 = perf_counter()

t.build(trees)

t3 = perf_counter()

print(f"Vector add: {t2 - t1:.2f}")
print(f"Index build: {t3 - t2:.2f}")

It creates a list of 100 tensors, loads them into annoy index and builds the index. This takes a minute on Intel Core i5 3570K (3.40 GHz).

However, if the tensors are converted to numpy array the operation takes 0.02 seconds.

The current workaround is to call numpy() on tensor before passing it to add_item:

    t.add_item(key, feature.numpy())

Tensorflow:

Vector add: 57.97
Index build: 0.21

Numpy:

Vector add: 0.02
Index build: 0.20

Any idea as to why this happens?

Versions:

Tensorflow: 2.3.0
Python: 3.6.7 x64
Annoy: 1.16.3
Windows 10: Build 19041
Numpy 1.19.1

Aug 06 '20 16:08 eduard93

convert_list_to_vector maybe?

  for (int z = 0; z < f; z++) {
    PyObject *key = PyInt_FromLong(z);
    PyObject *pf = PyObject_GetItem(v, key);
    (*w)[z] = PyFloat_AsDouble(pf);
    Py_DECREF(key);
    Py_DECREF(pf);
  }

https://github.com/spotify/annoy/blob/master/src/annoymodule.cc#L310

Aug 06 '20 17:08 eduard93

Maybe relevant?

Upgrading to Numpy 1.19.1 did not help.

Aug 06 '20 18:08 eduard93

odd. my guess is that this is something on the tensorflow side. maybe getting it item by item causes some sort of CPU<->GPU transfer that requires a context switch?

Aug 06 '20 20:08 erikbern

Might be a tensorflow issue. It is definitely not a CPU<->GPU issue as my test rig does not have a GPU.

Aug 07 '20 05:08 eduard93

The issue is present in both Tensorflow verisons. Tested in Docker.

Tensorflow: 1.15.2 (tensorflow/tensorflow:1.15.2-py3-jupyter):

Vector add: 29.82
Index build: 0.08

Tensorflow: 2.3.0 (tensorflow/tensorflow:latest-jupyter):

Vector add: 28.09
Index build: 0.08

Amended the script in OP by adding:

tf.compat.v1.enable_eager_execution()

Aug 07 '20 07:08 eduard93

I experience the same issue using pytorch.Tensor, even though the tensors are on my CPU.

Here are some benchmarks:


embedding_dim = 4000

embeddings = torch.rand(4000, embedding_dim, dtype=torch.float32)
embeddings.shape

# Timeit with raw tensors on cpu
%%timeit -n 2 -r 5
nn = AnnoyIndex(embedding_dim, metric="angular")

for idx, vector in enumerate(embeddings):
    nn.add_item(idx, vector)
    
nn.build(10)

>>> 16.9 s ± 111 ms per loop (mean ± std. dev. of 5 runs, 2 loops each)


# Timeit with raw tensors converted to numpy
%%timeit -n 2 -r 5
nn = AnnoyIndex(embedding_dim, metric="angular")

for idx, vector in enumerate(embeddings.numpy()):
    nn.add_item(idx, vector)
    
nn.build(10)

>>> 968 ms ± 4.47 ms per loop (mean ± std. dev. of 5 runs, 2 loops each)

It looks like the iteration trough a numpy array is in generell faster, but I am not sure if this explains the difference.

# Iterate through tensor
%%timeit -n 10 -r 50
for idx, vector in enumerate(embeddings):
    pass

>>> 2.49 ms ± 206 µs per loop (mean ± std. dev. of 50 runs, 10 loops each)

# Iterate through np.array
%%timeit -n 10 -r 50
for idx, vector in enumerate(embeddings.numpy()):
    pass

>>> 216 µs ± 14.6 µs per loop (mean ± std. dev. of 50 runs, 10 loops each)

Feb 24 '23 09:02 Maxl94