annoy icon indicating copy to clipboard operation
annoy copied to clipboard

Crashes when uses within AWS Lambda Python 3.6

Open lbustelo opened this issue 4 years ago • 12 comments

Runtime

Annoy 1.16.0 AWS Lambda Python 3.6 Runtime

Problem

Annoy causes an interpreter crash. AWS just continues to use the same Lambda container for subsequent calls, but the container just continues to fail each time.

More details

Index has 1300 items The vectors are 150 in length Configured at 10 trees Using angular metric The index is packaged with the Lambda code

Problem seems to take place when calling either

  • get_nns_by_item
  • get_distance
  • get_nns_by_vector (did not test, but I'm assuming same code path that causes issue)

Nothing is printed to the console. The Lambda just terminates with a "Process exited before completing request" message in the logs

I tried to see if I can identify an issue when loading the Annoy index by calling

  • get_n_items
  • get_n_trees
  • get_item_vector

but all worked fine.

The same packaged code also yields lambda instances that work fine. It seems that once the lambda is able successfully call get_nns_by_item it just works. Instances that fail on a call to get_nns_by_item, never recover. Note that since the interpreter crashes, the next time Lambda invokes the same lambda instance/container, it goes through the entire Cold Start process and loads the index again.

The most unfortunate part of this is that since there is no way of trapping the problem in the Python code (or at least I have not found a way), the Lambda exist but AWS continues to reuse the instance. If we could catch these error at the python level, I could throw a RuntimeError to force the destruction of the instance.

Any help is appreciated.

lbustelo avatar Sep 16 '19 19:09 lbustelo

Just a random thought but if you truly only have 1300 items then why do you even need Annoy? Just stick it in a numpy array and do an exhaustive search

erikbern avatar Sep 16 '19 20:09 erikbern

@erikbern Today 1300, tomorrow 10x more.

Anyway, I downgraded to 1.15.2 and so far no crashes. Fingers crossed.

lbustelo avatar Sep 16 '19 21:09 lbustelo

Ok, that’s odd that a previous version resolved the issue.

You haven’t been able to reproduce it locally?

erikbern avatar Sep 16 '19 23:09 erikbern

I have not been able to trigger it locally. I package and tests within a docker container running images from lambci.

lbustelo avatar Sep 17 '19 14:09 lbustelo

ok, can you reproduce it inside the docker container locally?

erikbern avatar Sep 17 '19 22:09 erikbern

I am having this exact same issue with a different NNS library: https://github.com/nmslib/hnswlib Believing the problem to be in hnswlib I was looking for alternatives when I came across this issue.

I now believe the problem to be with Lambda and using C++ dependencies in python.

blefevre avatar Nov 06 '19 16:11 blefevre

We are observing similar symptoms on Google App Engine. It works fine for a while, and then crashes on get_nns_by_vector. Nothing in the logs, just nginx reporting that upstream crashed.

ianterrell avatar Mar 09 '20 22:03 ianterrell

Just had the exact same case here, not using AWS or App Engine however. I have a Python webapp run with a Docker container (image built on MacOS and it works fine), but when I switched to Github actions to perform the docker build, a call to get_nns_by_vector makes the process crash (core dumped).

raphael0202 avatar Jun 12 '20 12:06 raphael0202

If anyone has any easy reproduction steps, would appreciate it!

erikbern avatar Jun 12 '20 18:06 erikbern

@erikbern I have this problem with lambda too with version 1.17+ but not with 1.15.2, I've seen this flag is removed from the setup in 1.17+

https://github.com/spotify/annoy/blob/v1.15.2/setup.py#L42

if os.environ.get('TRAVIS') == 'true':
    # Resolving some annoying issue
    extra_compile_args += ['-mno-avx']

before seing this, we recompiled 1.17 changing https://github.com/spotify/annoy/blob/master/setup.py#L43 to

cputune = ['-march=haswell',]

looks like is a problem with AVX on lambda, the fix is recommended on compilation here https://docs.aws.amazon.com/lambda/latest/dg/runtimes-avx2.html

nubol23 avatar Jan 19 '23 15:01 nubol23

Got it. It would be nicer if Annoy could detect the presence of AVX in runtime but that's a bit complex unfortunately.

erikbern avatar Jan 25 '23 17:01 erikbern

yes, anyway, I think it would be good to add the removed check or add a disclaimer in the readme to tell people who want to use it in lambda or similar to recompile it with the flag.

nubol23 avatar Jan 25 '23 19:01 nubol23