onnxruntime_backend icon indicating copy to clipboard operation
onnxruntime_backend copied to clipboard

Model loading failure: densenet_onnx fails to load due to "pthread_setaffinity_np" failure

Open shrek opened this issue 3 years ago • 4 comments

Description

I am testing tritonserver on the example models fetched using this script: https://github.com/triton-inference-server/server/blob/main/docs/examples/fetch_models.sh

triton server is run as follows:

export MODEL_PATH=/tmp/tensorrt-inference-server
/opt/tritonserver/bin/tritonserver  --strict-model-config=false --model-store=$MODEL_PATH/docs/examples/model_repository 2>&1 | tee $MODEL_PATH/svrStatus.txt

the server fails with:

I1130 21:40:16.147155 3120 server.cc:267] Timeout 29: Found 0 live models and 0 in-flight non-inference requests
error: creating server: Internal - failed to load all models

The densenet_onnx model fails to load with:

| densenet_onnx        | 1       | UNAVAILABLE: Internal: onnx runtime error 1: /workspace/onnxruntime/onnxruntime/core/platform/posix/env.cc:173 onnxruntime::{anonymous}::PosixThread::PosixThread(const char*, int, unsigned int (*)(int, Eigen::ThreadPoolInterface*), Eigen::ThreadPoolInterface*, const onnxruntime::ThreadOptions&) pthread_setaffinity_np failed, error code: 2 error msg: No such file or directory |

The container has has a restricted cpuset which likely contributes to the above failure:

cat /sys/fs/cgroup/cpuset/cpuset.cpus
9-12,49-52

The tritonserver works fine on another container whose cpuset looks like this:

cat /sys/fs/cgroup/cpuset/cpuset.cpus
0-255

Likely the onnxruntime threadoptions affinity setting has to match the cpuset.

Triton Information What version of Triton are you using? 2.15.0

Are you using the Triton container or did you build it yourself? using the nvidia ngc container tritonserver:21.10-py3

To Reproduce

Run the tritonserver container with a restricted cpuset. Inside container:

MODEL_PATH=/tmp/tensorrt-inference-server


git clone https://github.com/NVIDIA/tensorrt-inference-server.git
cd ${MODEL_PATH}/docs/examples/
bash fetch_models.sh

/opt/tritonserver/bin/tritonserver  --strict-model-config=false --model-store=$MODEL_PATH/docs/examples/model_repository 2>&1 | tee $MODEL_PATH/svrStatus.txt

Expected behavior

there should be no failure to load the densenet_onnx model.

shrek avatar Nov 30 '21 21:11 shrek

same problem.

inkinworld avatar Dec 15 '21 06:12 inkinworld

+1, also subscribing to this issue

Rikanishu avatar Jan 21 '22 16:01 Rikanishu

same problem

scse-l avatar Oct 25 '22 02:10 scse-l

same problem

ruanmk avatar Apr 15 '23 02:04 ruanmk