recommenders icon indicating copy to clipboard operation
recommenders copied to clipboard

Python 3.8 illegal istruction

Open rickycorte opened this issue 3 years ago • 21 comments

This is a bit more of a help for others that encounter my same error. Running tensorflow_recommenders with tensorflow 2.5 on python 3.8 kills the interpreter with a Illegal instruction (core dumped) when running import tensorflow_recommenders.

I wasted a bit of time and came up with a solution: match exactly your environment with the one running on colab. In this particular case using python 3.7 fixes the issue.

I'd suggest to state clearly the supported versions on the guides and the readme of this repository (an maybe also in the setup.py that stops at 3.6 ).

rickycorte avatar Jul 05 '21 18:07 rickycorte

Could you verify this is not a TensorFlow issue? It's extremely unlikely for TFRS to cause this as it has no compiled code.

maciejkula avatar Jul 07 '21 16:07 maciejkula

I've tested my tensorflow installation by running other models with no issues. Investigating further I've discovered that probably the crash originates from the scann library that is loaded on the import. I installed scann because I was following this guide to play around and test things: https://www.tensorflow.org/recommenders/examples/basic_retrieval

By setting PYTHONFAULTHANDLER="1" i got this stack trace here: crash_log I believe that at some point when including scann there is a load that fails due to a missing low level library or maybe a missing symbol. I've also tried to compile manually scann without any result. Uninstalling scann solves the crashes. Maybe in the next few days I'll try to compile tensorflow and then scann and see if it still crashes.

rickycorte avatar Jul 07 '21 18:07 rickycorte

Thanks!

I'm guessing this is because the ScaNN wheels aren't built for Python 3.8. @sammymax does that ring a bell?

maciejkula avatar Jul 07 '21 19:07 maciejkula

I've also tried with anaconda and python 3.7 but i still get the same result scann_crash.txt

rickycorte avatar Jul 07 '21 19:07 rickycorte

Hey thanks for reporting this bug! Can you provide the CPU model and operating system you're using? This kind of crash generally comes when the CPU tries to execute some vectorized instruction (AVX2, AVX, etc.) that the CPU in fact doesn't support. I'll be able to reproduce the issue a lot more easily once I know your OS and CPU details.

sammymax avatar Jul 09 '21 16:07 sammymax

I'm using an i7-3770k that should support avx as stated on the intel page. I'm running on ubuntu 20.04.2 lts

rickycorte avatar Jul 09 '21 17:07 rickycorte

I've been trying to reproduce this issue but I haven't been able to. I'm also using Ubuntu 20.04.2 LTS with an Ivy Bridge-era CPU (AVX but not AVX2 support). Are you using the system Python 3.8 or one from somewhere else (like pyenv)?

sammymax avatar Jul 14 '21 04:07 sammymax

I'm using the system python for version 3.8. I've also tried with python 3.7 with coda and still have the same kind of issue.

rickycorte avatar Jul 14 '21 15:07 rickycorte

I installed Conda with Python 3.7 and I also couldn't reproduce. I installed Anaconda 2021.05 from here and then did

conda create -n py37 python=3.7 anaconda
conda activate py37
pip install scann

and the import worked fine, and ScaNN managed to train and search ok too. This was all done on an Ivy Bridge CPU that should be very similar to your i7-3770K.

sammymax avatar Jul 15 '21 21:07 sammymax

I got the issue by running this to create the environment:

conda create -n py37 python=3.7 anaconda
conda activate py37
pip install tensorflow
pip install tensorflow-recommenders
pip install scann

I run export PYTHONFAULTHANDLER="1" to see the crash stack trace Now run python and type import tensorflow_recommenders . In this way i obtain a crash that have a stack trace similar to the ones I've posted before.

If i try to import directly scann:

[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import scann
2021-07-15 23:42:32.395084: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Fatal Python error: Illegal instruction

Current thread 0x00007fa464554740 (most recent call first):
  File "/home/rickycorte/anaconda3/envs/py37/lib/python3.7/site-packages/tensorflow/python/framework/load_library.py", line 58 in load_op_library
  File "/home/rickycorte/anaconda3/envs/py37/lib/python3.7/site-packages/scann/scann_ops/py/scann_ops.py", line 26 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 728 in exec_module
  File "<frozen importlib._bootstrap>", line 677 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 967 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 983 in _find_and_load
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1035 in _handle_fromlist
  File "/home/rickycorte/anaconda3/envs/py37/lib/python3.7/site-packages/scann/__init__.py", line 2 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 728 in exec_module
  File "<frozen importlib._bootstrap>", line 677 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 967 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 983 in _find_and_load
  File "<stdin>", line 1 in <module>
Illegal instruction (core dumped)

If i run pip uninstall scann an retry to import tensroflow_recommenders everything works fine.

Found existing installation: scann 1.2.2
Uninstalling scann-1.2.2:
  Would remove:
    /home/rickycorte/anaconda3/envs/py37/lib/python3.7/site-packages/scann-1.2.2.dist-info/*
    /home/rickycorte/anaconda3/envs/py37/lib/python3.7/site-packages/scann/*
Proceed (y/n)? y
  Successfully uninstalled scann-1.2.2
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow_recommenders
2021-07-15 23:44:38.629808: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
>>> 

Edit: i tried to install only scann as you did and i got the same error. I'm starting to think that maybe there is something wrong with my cuda installation outside of conda. I'll try out on a virtual machine with a clean ubuntu installation without any nvidia library.

Edit 2: Tried on a vm made right now using native python 3.8, no nvidia cuda but the issue persist.

Edit 3: Tried on new vm running on top of windows 10 and still the same (no cuda on host and guest). At this point I guess its some kind of issue of my machine that is not easly reproducible.

rickycorte avatar Jul 15 '21 21:07 rickycorte

I think I'm seeing the crash in scann:

Thread 1 "python" received signal SIGILL, Illegal instruction.                    
0x00007fff183f910a in google::protobuf::FieldDescriptorProto::FieldDescriptorProto() () from .../python3.10/site-packages/scann/scann_ops/cc/_scann_ops.so

   0x00007fff183f9105 <+53>:  vmovq  %rax,%xmm0
=> 0x00007fff183f910a <+58>:  vpbroadcastq %xmm0,%ymm0
   0x00007fff183f910f <+63>:  vmovdqu %ymm0,0x18(%rbx)

I'm pretty sure VPBROADCAST from xmm to ymm is an AVX512 instruction, which my CPU (Sandybridge) doesn't have.

emikulic avatar Aug 14 '22 08:08 emikulic

Thanks for debugging--I think the issue is that vpbroadcastq is an AVX2 instruction, which Sandy Bridge doesn't support. We will look into compiling the ScaNN wheels the next release without the -mavx2 flag so that this issue is resolved. You can try compiling ScaNN yourself without that flag in the meantime to see if that fixes the problem.

sammymax avatar Aug 19 '22 19:08 sammymax

Nice, thanks! I was able to build scann without AVX2 using an older version of bazel :)

emikulic avatar Aug 20 '22 02:08 emikulic

@emikulic could you post steps to build scann without AVX2 using an older version of bazel? We have troubles installing old versions of bazel

Thanks

avber avatar Aug 26 '22 12:08 avber

@sammymax a docker file for the build environment would be nice to have

avber avatar Aug 26 '22 12:08 avber

Here's a related Dockerfile that might help; it compiles a version of TensorFlow Serving linked against ScaNN: https://github.com/google-research/google-research/blob/master/scann/tf_serving/Dockerfile.devel

What problems have you encountered with old versions of Bazel?

sammymax avatar Aug 26 '22 17:08 sammymax

This worked for me:

git clone [email protected]:google-research/google-research.git --depth=1
cd google-research/scann/
python configure.py
# get https://github.com/bazelbuild/bazelisk/releases/download/v1.12.0/bazelisk-linux-amd64
# install as "bazel"
echo 3.7.2 > .bazelversion
# note -march=native instead of -march=avx2:
CC=clang bazel build -c opt --features=thin_lto --copt=-march=native --cxxopt="-std=c++17" --copt=-fsized-deallocation --copt=-w :build_pip_pkg
./bazel-bin/build_pip_pkg
# produces scann-1.2.7-cp310-cp310-linux_x86_64.whl which you can "pip install"

emikulic avatar Aug 26 '22 22:08 emikulic

@emikulic @sammymax Thank you, we have compiled scann-1.2.7 successfully.

However, export of a trained TF Lite model failed (TF 2.9.1 and scann-1.2.7). Export worked on Colab (TF 2.8.2 and scann-1.2.6)

@sammymax Is it possible to check out scann-1.2.6 branch from the repo?

As for Bazel, the sysadmin installed a 5.x version but couldn't downgrade it for some reason.

avber avatar Sep 06 '22 12:09 avber

Were you able to use bazelisk to get an older version of bazel?

emikulic avatar Sep 07 '22 00:09 emikulic

Yes, the sysadmin managed to install an older version of bazel.

avber avatar Sep 07 '22 01:09 avber

ScaNN 1.2.8 was recently released and doesn't assume AVX2 support; we now compile with -mavx rather than -mavx2, and do runtime dispatch to AVX2, when supported, for the important routines. Hopefully this helps.

sammymax avatar Sep 15 '22 17:09 sammymax