annoy
annoy copied to clipboard
Crashes when uses within AWS Lambda Python 3.6
Runtime
Annoy 1.16.0 AWS Lambda Python 3.6 Runtime
Problem
Annoy causes an interpreter crash. AWS just continues to use the same Lambda container for subsequent calls, but the container just continues to fail each time.
More details
Index has 1300 items
The vectors are 150 in length
Configured at 10 trees
Using angular
metric
The index is packaged with the Lambda code
Problem seems to take place when calling either
- get_nns_by_item
- get_distance
- get_nns_by_vector (did not test, but I'm assuming same code path that causes issue)
Nothing is printed to the console. The Lambda just terminates with a "Process exited before completing request" message in the logs
I tried to see if I can identify an issue when loading the Annoy index by calling
- get_n_items
- get_n_trees
- get_item_vector
but all worked fine.
The same packaged code also yields lambda instances that work fine. It seems that once the lambda is able successfully call get_nns_by_item
it just works. Instances that fail on a call to get_nns_by_item
, never recover. Note that since the interpreter crashes, the next time Lambda invokes the same lambda instance/container, it goes through the entire Cold Start process and loads the index again.
The most unfortunate part of this is that since there is no way of trapping the problem in the Python code (or at least I have not found a way), the Lambda exist but AWS continues to reuse the instance. If we could catch these error at the python level, I could throw a RuntimeError to force the destruction of the instance.
Any help is appreciated.
Just a random thought but if you truly only have 1300 items then why do you even need Annoy? Just stick it in a numpy array and do an exhaustive search
@erikbern Today 1300, tomorrow 10x more.
Anyway, I downgraded to 1.15.2 and so far no crashes. Fingers crossed.
Ok, that’s odd that a previous version resolved the issue.
You haven’t been able to reproduce it locally?
I have not been able to trigger it locally. I package and tests within a docker container running images from lambci.
ok, can you reproduce it inside the docker container locally?
I am having this exact same issue with a different NNS library: https://github.com/nmslib/hnswlib Believing the problem to be in hnswlib I was looking for alternatives when I came across this issue.
I now believe the problem to be with Lambda and using C++ dependencies in python.
We are observing similar symptoms on Google App Engine. It works fine for a while, and then crashes on get_nns_by_vector
. Nothing in the logs, just nginx reporting that upstream crashed.
Just had the exact same case here, not using AWS or App Engine however. I have a Python webapp run with a Docker container (image built on MacOS and it works fine), but when I switched to Github actions to perform the docker build, a call to get_nns_by_vector makes the process crash (core dumped).
If anyone has any easy reproduction steps, would appreciate it!
@erikbern I have this problem with lambda too with version 1.17+ but not with 1.15.2, I've seen this flag is removed from the setup in 1.17+
https://github.com/spotify/annoy/blob/v1.15.2/setup.py#L42
if os.environ.get('TRAVIS') == 'true':
# Resolving some annoying issue
extra_compile_args += ['-mno-avx']
before seing this, we recompiled 1.17 changing https://github.com/spotify/annoy/blob/master/setup.py#L43 to
cputune = ['-march=haswell',]
looks like is a problem with AVX on lambda, the fix is recommended on compilation here https://docs.aws.amazon.com/lambda/latest/dg/runtimes-avx2.html
Got it. It would be nicer if Annoy could detect the presence of AVX in runtime but that's a bit complex unfortunately.
yes, anyway, I think it would be good to add the removed check or add a disclaimer in the readme to tell people who want to use it in lambda or similar to recompile it with the flag.