server icon indicating copy to clipboard operation
server copied to clipboard

Server Hangs when loading python backend w/ pytorch (including the example)

Open pionex opened this issue 2 years ago • 1 comments

Description I am trying to load the python backend pytorch example using the docker hosted triton server (22.04-py3). If the model tries to import torch, then the server appears to hang forever during loading. If I comment out this import line (and the code that uses it), the model will load.

Triton Information NVIDIA Release 22.04 (build 36821869) Triton Server Version 2.21.0

Are you using the Triton container or did you build it yourself? Triton Container

To Reproduce

  1. Create a model using files from the repo: https://github.com/triton-inference-server/python_backend/tree/main/examples/pytorch

  2. Create a conda environment with PyTorch export PYTHONNOUSERSITE=True conda create -c pytorch -n torch python=3.8 pytorch

Versions: python 3.8.13 h12debd9_0
pytorch 1.12.1 py3.8_cuda11.3_cudnn8.3.2_0 pytorch

  1. Modify Config File to use conda env parameters: { key: "EXECUTION_ENV_PATH", value: {string_value: "$$TRITON_MODEL_DIRECTORY/torch.tar.gz"} }

  2. Launch Server docker run --gpus=1 --rm --net=host -v /home/perception/models:/models nvcr.io/nvidia/tritonserver:22.04-py3 tritonserver --model-repository=/models --log-verbose=1

Hangs after this output: I0922 01:31:42.068212 1 python.cc:1769] Using Python execution env /models/pytorch/yolactpy.tar.gz I0922 01:31:42.068527 1 python.cc:2054] TRITONBACKEND_ModelInstanceInitialize: pytorch_0 (CPU device 0) I0922 01:31:42.068560 1 backend_model_instance.cc:68] Creating instance pytorch_0 on CPU using artifact '' I0922 01:32:21.307565 84 python.cc:630] Starting Python backend stub: source /tmp/python_env_VLyc7f/0/bin/activate && exec env LD_LIBRARY_PATH=/tmp/python_env_VLyc7f/0/lib:$LD_LIBRARY_PATH /opt/tritonserver/backends/python/triton_python_backend_stub /models/pytorch/1/model.py triton_python_backend_shm_region_1 67108864 67108864 1 /opt/tritonserver/backends/python 336 pytorch_0

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).

Expected behavior The server should load the model and progress as usual. In my case, the server hangs at the indicated line of code and never actually binds to the socket or otherwise continues. nvidia-smi shows tritonserver using about 700MB of gpu memory.

pionex avatar Sep 22 '22 01:09 pionex

There seems to be an issue with the conda environment I believe. I used a shell to install torch via pip in the docker instance and the model loaded instantly with no problems. When I ran an strace while using the conda environment, I did notice some files were not able to be opened. For example: id 4153576] openat(AT_FDCWD, "/tmp/python_env_3uVpgg/0/lib/python3.8/site-packages/mkl/../../.././libmemkind.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) [pid 4153576] openat(AT_FDCWD, "/tmp/python_env_3uVpgg/0/lib/libmemkind.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) [pid 4153576] openat(AT_FDCWD, "/usr/local/cuda/compat/lib.real/libmemkind.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) [pid 4153576] openat(AT_FDCWD, "/opt/tritonserver/backends/onnxruntime/libmemkind.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) [pid 4153576] openat(AT_FDCWD, "/usr/local/cuda/compat/lib/libmemkind.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) [pid 4153576] openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 5

Does anyone have a simple example of how to build a conda runtime that will work?

pionex avatar Sep 22 '22 02:09 pionex

Closing issue due to lack of activity. Please re-open the issue if you would like to follow up with this issue

jbkyang-nvi avatar Nov 22 '22 03:11 jbkyang-nvi