onnxruntime icon indicating copy to clipboard operation
onnxruntime copied to clipboard

Segmentation fault when running onnxruntime inside docker with cpuset restrictions

Open yindavidyang opened this issue 3 years ago • 13 comments

Describe the bug A clear and concise description of what the bug is.

Onnxruntime crashes when I run it inside Docker with CPU limitations specified by "cpuset-cpus". The crash doesn't happen when running Docker without "cpuset-cpus" arg, or running Docker with "cpuset-cpus" with a lot of CPU cores.

Urgency If there are particular important use cases blocked by this or strict project-related timelines, please share more information and dates. If there are no hard deadlines, please specify none.

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 20.04
  • ONNX Runtime installed from (source or binary): pip
  • ONNX Runtime version: 1.7.0
  • Python version: 3.8
  • Visual Studio version (if applicable):
  • GCC/Compiler version (if compiling from source):
  • CUDA/cuDNN version:
  • GPU model and memory:

To Reproduce

  • Describe steps/code to reproduce the behavior.
  • Attach the ONNX model to the issue (where applicable) to expedite investigation.

Hardware: 32 core AMD CPU (64 threads). 4x 2080Ti GPUs

The crash doesn't happen when I provision many cores, such as "--cpuset-cpus 0-31".

docker run --rm -it --gpus all --cpuset-cpus 0-15 nvidia/cuda:11.0.3-cudnn8-devel-ubuntu20.04

then, inside docker container

apt update
apt install python3-pip wget
pip3 install onnxruntime
wget https://github.com/onnx/models/blob/master/vision/classification/mnist/model/mnist-7.onnx?raw=true -O mnist.onnx
python3

then, inside python3:

import onnxruntime as ort
ort.InferenceSession('mnist.onnx') # crash!

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Additional context Add any other context about the problem here. If the issue is about a particular model, please share the model details as well to facilitate debugging.

yindavidyang avatar Apr 01 '21 17:04 yindavidyang

Can you paste the stack trace here?

pranavsharma avatar Apr 01 '21 21:04 pranavsharma

there's no stack trace -- just a one-liner message saying core dumped.

yindavidyang avatar Apr 02 '21 20:04 yindavidyang

Run gdb <executable name> <core file name>. See this to get the location of the core file: https://askubuntu.com/a/1109747

pranavsharma avatar Apr 02 '21 21:04 pranavsharma

Could you tell me how to collect core dump when running inside a docker image (note that this crash only happens within a docker image with a cpuset-cpus setting)? I followed the askubuntu instructions but kept getting "read only file system" errors when trying to configure apport, and the "/var/crash" folder was empty after the crash.

Another question: I guess executable name should be python3, correct? The command that led the core dump is python3 -c "import onnxruntime as ort; ort.InferenceSession('mnist.onnx')".

BTW sometimes I got a "bus error (core dumped)" message instead of segmentation fault.

yindavidyang avatar Apr 03 '21 11:04 yindavidyang

I found a "core" file in the current folder where I ran python3. Here's what GDB says:

Core was generated by `python3'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00007f2862dded52 in std::_Function_handler<bool (), onnxruntime::concurrency::ThreadPoolTemplonnxruntime::Env::WorkerLoop(int)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () from /usr/local/lib/python3.8/dist-packages/onnxruntime/capi/onnxruntime_pybind11_state.cpython-38-x86_64-linux-gnu.so [Current thread is 1 (Thread 0x7f284d2ba700 (LWP 825))]

yindavidyang avatar Apr 03 '21 11:04 yindavidyang

I cannot repro the crash.

(pranav-py37) pranav@FooMachine:~$ docker pull nvidia/cuda:11.0.3-cudnn8-devel-ubuntu20.04
(pranav-py37) pranav@FooMachine:~$ docker run --rm -it --gpus all --cpuset-cpus 0-15 nvidia/cuda:11.0.3-cudnn8-devel-ubuntu20.04
root@113a6dc63d69:/# python3
Python 3.8.5 (default, Jan 27 2021, 15:41:15)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import onnxruntime as ort
>>> ort.InferenceSession('mnist.onnx')
<onnxruntime.capi.onnxruntime_inference_collection.InferenceSession object at 0x7ffb9a09e280>
>>>

pranavsharma avatar Apr 08 '21 08:04 pranavsharma

I also met this problem. I use mcr.microsoft.com/azureml/onnxruntime:v1.6.0-cuda10.2-cudnn8 with docker-compose cpuset: 0-7, it crash with core dump on CentOS Linux release 7.4.1708 host OS.

austingg avatar Sep 19 '21 02:09 austingg

Seen this problem as well. A solution that worked for me was to set the number of intra_op_num_threads to something corresponding to the number of available cores:

import onnxruntime as ort
sess_options = ort.SessionOptions()
sess_options.intra_op_num_threads = 8
sess = ort.InferenceSession('some_model.onnx', sess_options=sess_options)

srolsorama avatar Nov 10 '21 11:11 srolsorama

Seen this problem as well. A solution that worked for me was to set the number of intra_op_num_threads to something corresponding to the number of available cores:

import onnxruntime as ort
sess_options = ort.SessionOptions()
sess_options.intra_op_num_threads = 8
sess = ort.InferenceSession('some_model.onnx', sess_options=sess_options)

Thank you so much! That works for me!!!

ppalantir avatar Dec 06 '21 04:12 ppalantir

This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

stale[bot] avatar Apr 17 '22 10:04 stale[bot]

Seen this problem as well. A solution that worked for me was to set the number of intra_op_num_threads to something corresponding to the number of available cores:

import onnxruntime as ort
sess_options = ort.SessionOptions()
sess_options.intra_op_num_threads = 8
sess = ort.InferenceSession('some_model.onnx', sess_options=sess_options)

Thank you so much! It solves my problems as well.

yuleichin avatar Jul 20 '22 11:07 yuleichin

也看到了这个问题。一个对我有用的解决方案是将 intra_op_num_threads 的数量设置为与可用内核数量相对应的值:

import onnxruntime as ort
sess_options = ort.SessionOptions()
sess_options.intra_op_num_threads = 8
sess = ort.InferenceSession('some_model.onnx', sess_options=sess_options)

非常感谢!!!

l3yx avatar Sep 16 '22 02:09 l3yx

I can't reproduce the seg fault (onnx 1.12) however, when loading the model in a container with the cpu set argument, the ort.InferenceSession simply never returns. Providing the number of threads allowed fixes my problem above though

Buillaume avatar Sep 19 '22 16:09 Buillaume

Seen this problem as well. A solution that worked for me was to set the number of intra_op_num_threads to something corresponding to the number of available cores:

import onnxruntime as ort
sess_options = ort.SessionOptions()
sess_options.intra_op_num_threads = 8
sess = ort.InferenceSession('some_model.onnx', sess_options=sess_options)

Is intra_op_num_threads supposed to be the number of cores on the machine or the number of cores I want the ONNX model to be restricted to?

timbmg avatar Feb 23 '23 10:02 timbmg

Hello, I wanted to mention that we observe very similar behavior.

We use InferenceSession for CPU-based inference in our production service. We deploy using EKS and specify CPU requests 3000m in the kubernetes deployment. If we deploy to t3.xlarge instance, which hosts one pod, all works well, but when we deploy to t3.2xlarge, which hosts 2 pods, we start seeing segmentation faults on shutdown. The error happens during exit of the Python process. If we hardcode intra_op_num_threads=1 (or 2), it seems to work.

fsrajer avatar Oct 26 '23 10:10 fsrajer