annoy Crashes when uses within AWS Lambda Python 3.6

Runtime

Annoy 1.16.0 AWS Lambda Python 3.6 Runtime

Problem

Annoy causes an interpreter crash. AWS just continues to use the same Lambda container for subsequent calls, but the container just continues to fail each time.

More details

Index has 1300 items The vectors are 150 in length Configured at 10 trees Using angular metric The index is packaged with the Lambda code

Problem seems to take place when calling either

get_nns_by_item
get_distance
get_nns_by_vector (did not test, but I'm assuming same code path that causes issue)

Nothing is printed to the console. The Lambda just terminates with a "Process exited before completing request" message in the logs

I tried to see if I can identify an issue when loading the Annoy index by calling

get_n_items
get_n_trees
get_item_vector

but all worked fine.

The same packaged code also yields lambda instances that work fine. It seems that once the lambda is able successfully call get_nns_by_item it just works. Instances that fail on a call to get_nns_by_item, never recover. Note that since the interpreter crashes, the next time Lambda invokes the same lambda instance/container, it goes through the entire Cold Start process and loads the index again.

The most unfortunate part of this is that since there is no way of trapping the problem in the Python code (or at least I have not found a way), the Lambda exist but AWS continues to reuse the instance. If we could catch these error at the python level, I could throw a RuntimeError to force the destruction of the instance.

Any help is appreciated.

Sep 16 '19 19:09 lbustelo

Just a random thought but if you truly only have 1300 items then why do you even need Annoy? Just stick it in a numpy array and do an exhaustive search

Sep 16 '19 20:09 erikbern

@erikbern Today 1300, tomorrow 10x more.

Anyway, I downgraded to 1.15.2 and so far no crashes. Fingers crossed.

Sep 16 '19 21:09 lbustelo

Ok, that’s odd that a previous version resolved the issue.

You haven’t been able to reproduce it locally?

Sep 16 '19 23:09 erikbern

I have not been able to trigger it locally. I package and tests within a docker container running images from lambci.

Sep 17 '19 14:09 lbustelo

ok, can you reproduce it inside the docker container locally?

Sep 17 '19 22:09 erikbern

I am having this exact same issue with a different NNS library: https://github.com/nmslib/hnswlib Believing the problem to be in hnswlib I was looking for alternatives when I came across this issue.

I now believe the problem to be with Lambda and using C++ dependencies in python.

Nov 06 '19 16:11 blefevre

We are observing similar symptoms on Google App Engine. It works fine for a while, and then crashes on get_nns_by_vector. Nothing in the logs, just nginx reporting that upstream crashed.

Mar 09 '20 22:03 ianterrell

Just had the exact same case here, not using AWS or App Engine however. I have a Python webapp run with a Docker container (image built on MacOS and it works fine), but when I switched to Github actions to perform the docker build, a call to get_nns_by_vector makes the process crash (core dumped).

Jun 12 '20 12:06 raphael0202

If anyone has any easy reproduction steps, would appreciate it!

Jun 12 '20 18:06 erikbern

@erikbern I have this problem with lambda too with version 1.17+ but not with 1.15.2, I've seen this flag is removed from the setup in 1.17+

https://github.com/spotify/annoy/blob/v1.15.2/setup.py#L42

if os.environ.get('TRAVIS') == 'true':
    # Resolving some annoying issue
    extra_compile_args += ['-mno-avx']

before seing this, we recompiled 1.17 changing https://github.com/spotify/annoy/blob/master/setup.py#L43 to

cputune = ['-march=haswell',]

looks like is a problem with AVX on lambda, the fix is recommended on compilation here https://docs.aws.amazon.com/lambda/latest/dg/runtimes-avx2.html

Jan 19 '23 15:01 nubol23

Got it. It would be nicer if Annoy could detect the presence of AVX in runtime but that's a bit complex unfortunately.

Jan 25 '23 17:01 erikbern

yes, anyway, I think it would be good to add the removed check or add a disclaimer in the readme to tell people who want to use it in lambda or similar to recompile it with the flag.

Jan 25 '23 19:01 nubol23

annoy annoy copied to clipboard

Crashes when uses within AWS Lambda Python 3.6

Runtime

Problem

More details

annoy
annoy copied to clipboard