server icon indicating copy to clipboard operation
server copied to clipboard

Python backend stuck at TRITONBACKEND_ModelInstanceInitialize

Open huangyz0918 opened this issue 3 years ago • 11 comments

Description I wanna start the python backend following the example. But the container stucks at

=============================
== Triton Inference Server ==
=============================

NVIDIA Release 22.04 (build 36821869)
Triton Server Version 2.21.0

Copyright (c) 2018-2022, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

WARNING: The NVIDIA Driver was not detected.  GPU functionality will not be available.
   Use the NVIDIA Container Toolkit to start this container with GPU support; see
   https://docs.nvidia.com/datacenter/cloud-native/ .

WARNING: [Torch-TensorRT] - Unable to read CUDA capable devices. Return status: 35
I0504 12:25:52.440894 1 libtorch.cc:1381] TRITONBACKEND_Initialize: pytorch
I0504 12:25:52.441090 1 libtorch.cc:1391] Triton TRITONBACKEND API version: 1.9
I0504 12:25:52.441100 1 libtorch.cc:1397] 'pytorch' TRITONBACKEND API version: 1.9
W0504 12:25:52.441171 1 pinned_memory_manager.cc:236] Unable to allocate pinned system memory, pinned memory pool will not be available: CUDA driver version is insufficient for CUDA runtime version
I0504 12:25:52.441209 1 cuda_memory_manager.cc:115] CUDA memory pool disabled
I0504 12:25:52.442350 1 model_repository_manager.cc:1077] loading: resnet:1
I0504 12:25:52.547089 1 python.cc:1769] Using Python execution env /models/resnet/../my-pytorch.tar.gz
I0504 12:25:52.547228 1 python.cc:2054] TRITONBACKEND_ModelInstanceInitialize: resnet_0 (CPU device 0)

My machine dose not have GPU. The confg

name: "resnet"
backend: "python"

input [
  {
    name: "INPUT0"
    data_type: TYPE_FP32
    dims: [ -1, 3, 224, 224 ]
  }
]

output [
  {
    name: "OUTPUT0"
    data_type: TYPE_FP32
    dims: [ -1, 1000 ]
  }
]

instance_group [{
  count: 1
  kind: KIND_CPU
}]

parameters: {
  key: "EXECUTION_ENV_PATH",
  value: {string_value: "$$TRITON_MODEL_DIRECTORY/../my-pytorch.tar.gz"}
}

Triton Information nvcr.io/nvidia/tritonserver:22.04-pyt-python-py3

Are you using the Triton container or did you build it yourself? docker container

huangyz0918 avatar May 04 '22 12:05 huangyz0918

I found the reason, when you start the docker container, you need to use --shm-size=1g --ulimit memlock=-1 to increase the cpu resources that can be used by docker. The container supposed to throw the error rather than stuck at some logs.

huangyz0918 avatar May 04 '22 12:05 huangyz0918

@huangyz0918 is this the full log/output? Can you share the full exact docker run ... command and the full tritonserver --model-repository ... command you ran to reproduce the issue?

CC @Tabrizian

rmccorm4 avatar May 04 '22 18:05 rmccorm4

The original docker script, I ran this on an AWS EC2 instance only contains CPUs.

#!/usr/bin/env bash
docker run --rm -p8900:8000 -p8901:8001 -p8902:8002 \
  -v${PWD}/../example/pytorch/resnet:/models/resnet/ \
  -v${PWD}/../example/pytorch/my-pytorch.tar.gz:/models/my-pytorch.tar.gz \
  nvcr.io/nvidia/tritonserver:22.04-pyt-python-py3 tritonserver --model-repository=/models

After chaning to this, the server works,

#!/usr/bin/env bash
docker run --rm -p8900:8000 -p8901:8001 -p8902:8002 \
  --shm-size=1g --ulimit memlock=-1 \
  -v${PWD}/../example/pytorch/resnet:/models/resnet/ \
  -v${PWD}/../example/pytorch/my-pytorch.tar.gz:/models/my-pytorch.tar.gz \
  nvcr.io/nvidia/tritonserver:22.04-pyt-python-py3 tritonserver --model-repository=/models

huangyz0918 avatar May 06 '22 03:05 huangyz0918

@Tabrizian I thought Python backend would return error if there is no enough shared memory available for initialization?

GuanLuo avatar May 06 '22 21:05 GuanLuo

I tried a small example locally and it did return an error if there wasn't enough shared memory. @rmccorm4 Could you please file a ticket for this issue so that I can take a closer look? Thanks.

Tabrizian avatar May 06 '22 21:05 Tabrizian

@Tabrizian filed DLIS-3765

rmccorm4 avatar May 06 '22 22:05 rmccorm4

We encounter the same issue, it stuck at loading model without returning an error. How can we properly diagnose? Thanks

mikelam92 avatar May 28 '22 03:05 mikelam92

@Tabrizian seeing the same problem and unable to find anything specific from the verbose logs. Even tried @huangyz0918 workaround but that didnt help either. Can you suggest a solution or possible workaround? This is blocking us from making any progress.



=============================
--
Mon, May 30 2022 7:56:26 pm | == Triton Inference Server ==
Mon, May 30 2022 7:56:26 pm | =============================
Mon, May 30 2022 7:56:26 pm |  
Mon, May 30 2022 7:56:26 pm | NVIDIA Release 22.05 (build 38317651)
Mon, May 30 2022 7:56:26 pm | Triton Server Version 2.22.0
Mon, May 30 2022 7:56:26 pm |  
Mon, May 30 2022 7:56:26 pm | Copyright (c) 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Mon, May 30 2022 7:56:26 pm |  
Mon, May 30 2022 7:56:26 pm | Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Mon, May 30 2022 7:56:26 pm |  
Mon, May 30 2022 7:56:26 pm | This container image and its contents are governed by the NVIDIA Deep Learning Container License.
Mon, May 30 2022 7:56:26 pm | By pulling and using the container, you accept the terms and conditions of this license:
Mon, May 30 2022 7:56:26 pm | https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
Mon, May 30 2022 7:56:26 pm |  
Mon, May 30 2022 7:56:26 pm | WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available.
Mon, May 30 2022 7:56:26 pm | Use the NVIDIA Container Toolkit to start this container with GPU support; see
Mon, May 30 2022 7:56:26 pm | https://docs.nvidia.com/datacenter/cloud-native/ .
Mon, May 30 2022 7:56:26 pm |  
Mon, May 30 2022 7:56:26 pm | W0531 02:56:26.616090 1 pinned_memory_manager.cc:236] Unable to allocate pinned system memory, pinned memory pool will not be available: CUDA driver version is insufficient for CUDA runtime version
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.616205 1 cuda_memory_manager.cc:115] CUDA memory pool disabled
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.616952 1 model_config_utils.cc:645] Server side auto-completed config: name: "onnx-model"
Mon, May 30 2022 7:56:26 pm | platform: "onnxruntime_onnx"
Mon, May 30 2022 7:56:26 pm | input {
Mon, May 30 2022 7:56:26 pm | name: "images"
Mon, May 30 2022 7:56:26 pm | data_type: TYPE_FP32
Mon, May 30 2022 7:56:26 pm | dims: -1
Mon, May 30 2022 7:56:26 pm | dims: -1
Mon, May 30 2022 7:56:26 pm | dims: -1
Mon, May 30 2022 7:56:26 pm | }
Mon, May 30 2022 7:56:26 pm | output {
Mon, May 30 2022 7:56:26 pm | name: "boxes"
Mon, May 30 2022 7:56:26 pm | data_type: TYPE_FP32
Mon, May 30 2022 7:56:26 pm | dims: -1
Mon, May 30 2022 7:56:26 pm | dims: 4
Mon, May 30 2022 7:56:26 pm | }
Mon, May 30 2022 7:56:26 pm | output {
Mon, May 30 2022 7:56:26 pm | name: "labels"
Mon, May 30 2022 7:56:26 pm | data_type: TYPE_INT64
Mon, May 30 2022 7:56:26 pm | dims: -1
Mon, May 30 2022 7:56:26 pm | }
Mon, May 30 2022 7:56:26 pm | output {
Mon, May 30 2022 7:56:26 pm | name: "scores"
Mon, May 30 2022 7:56:26 pm | data_type: TYPE_FP32
Mon, May 30 2022 7:56:26 pm | dims: -1
Mon, May 30 2022 7:56:26 pm | }
Mon, May 30 2022 7:56:26 pm | output {
Mon, May 30 2022 7:56:26 pm | name: "caption_locations"
Mon, May 30 2022 7:56:26 pm | data_type: TYPE_INT64
Mon, May 30 2022 7:56:26 pm | dims: -1
Mon, May 30 2022 7:56:26 pm | }
Mon, May 30 2022 7:56:26 pm | default_model_filename: "model.onnx"
Mon, May 30 2022 7:56:26 pm | backend: "onnxruntime"
Mon, May 30 2022 7:56:26 pm |  
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.617010 1 model_repository_manager.cc:898] AsyncLoad() 'onnx-model'
Mon, May 30 2022 7:56:26 pm | W0531 02:56:26.617092 1 model_repository_manager.cc:315] ignore version directory 'docs' which fails to convert to integral number
Mon, May 30 2022 7:56:26 pm | W0531 02:56:26.617111 1 model_repository_manager.cc:315] ignore version directory 'test_dataset' which fails to convert to integral number
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.617123 1 model_repository_manager.cc:1136] TriggerNextAction() 'onnx-model' version 1: 1
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.617130 1 model_repository_manager.cc:1172] Load() 'onnx-model' version 1
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.617133 1 model_repository_manager.cc:1191] loading: onnx-model:1
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.717313 1 model_repository_manager.cc:1249] CreateModel() 'onnx-model' version 1
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.717399 1 backend_model.cc:292] Adding default backend config setting: default-max-batch-size,4
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.717419 1 shared_library.cc:108] OpenLibraryHandle: /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.718380 1 onnxruntime.cc:2426] TRITONBACKEND_Initialize: onnxruntime
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.718400 1 onnxruntime.cc:2436] Triton TRITONBACKEND API version: 1.9
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.718406 1 onnxruntime.cc:2442] 'onnxruntime' TRITONBACKEND API version: 1.9
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.718410 1 onnxruntime.cc:2472] backend configuration:
Mon, May 30 2022 7:56:26 pm | {"cmdline":{"auto-complete-config":"false","min-compute-capability":"6.000000","backend-directory":"/opt/tritonserver/backends","default-max-batch-size":"4"}}
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.728513 1 onnxruntime.cc:2507] TRITONBACKEND_ModelInitialize: onnx-model (version 1)
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.729581 1 model_config_utils.cc:1597] ModelConfig 64-bit fields:
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.729597 1 model_config_utils.cc:1599] ModelConfig::dynamic_batching::default_queue_policy::default_timeout_microseconds
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.729600 1 model_config_utils.cc:1599] ModelConfig::dynamic_batching::max_queue_delay_microseconds
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.729603 1 model_config_utils.cc:1599] ModelConfig::dynamic_batching::priority_queue_policy::value::default_timeout_microseconds
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.729606 1 model_config_utils.cc:1599] ModelConfig::ensemble_scheduling::step::model_version
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.729609 1 model_config_utils.cc:1599] ModelConfig::input::dims
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.729611 1 model_config_utils.cc:1599] ModelConfig::input::reshape::shape
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.729616 1 model_config_utils.cc:1599] ModelConfig::instance_group::secondary_devices::device_id
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.729620 1 model_config_utils.cc:1599] ModelConfig::model_warmup::inputs::value::dims
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.729624 1 model_config_utils.cc:1599] ModelConfig::optimization::cuda::graph_spec::graph_lower_bound::input::value::dim
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.729627 1 model_config_utils.cc:1599] ModelConfig::optimization::cuda::graph_spec::input::value::dim
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.729631 1 model_config_utils.cc:1599] ModelConfig::output::dims
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.729636 1 model_config_utils.cc:1599] ModelConfig::output::reshape::shape
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.729640 1 model_config_utils.cc:1599] ModelConfig::sequence_batching::direct::max_queue_delay_microseconds
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.729644 1 model_config_utils.cc:1599] ModelConfig::sequence_batching::max_sequence_idle_microseconds
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.729648 1 model_config_utils.cc:1599] ModelConfig::sequence_batching::oldest::max_queue_delay_microseconds
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.729651 1 model_config_utils.cc:1599] ModelConfig::sequence_batching::state::dims
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.729653 1 model_config_utils.cc:1599] ModelConfig::sequence_batching::state::initial_state::dims
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.729656 1 model_config_utils.cc:1599] ModelConfig::version_policy::specific::versions
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.729795 1 onnxruntime.cc:2550] TRITONBACKEND_ModelInstanceInitialize: onnx-model (CPU device 0)
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.729808 1 backend_model_instance.cc:68] Creating instance onnx-model on CPU using artifact 'model.onnx'
Mon, May 30 2022 7:56:26 pm | 2022-05-31 02:56:26.730251261 [I:onnxruntime:, inference_session.cc:324 operator()] Flush-to-zero and denormal-as-zero are off
Mon, May 30 2022 7:56:26 pm | 2022-05-31 02:56:26.730269362 [I:onnxruntime:, inference_session.cc:331 ConstructorCommon] Creating and using per session threadpools since use_per_session_threads_ is true
Mon, May 30 2022 7:56:26 pm | 2022-05-31 02:56:26.730277162 [I:onnxruntime:, inference_session.cc:351 ConstructorCommon] Dynamic block base set to 0

Pod.Yaml

- image: cloudopscontainerregistrydevglobal.azurecr.io/tab-detection:505921
    imagePullPolicy: IfNotPresent
    name: tab-detection
    resources:
      limits:
        cpu: "5"
        memory: 5Gi
      requests:
        cpu: "5"
        memory: 5Gi
    securityContext:
      privileged: true
      runAsGroup: 18003
      runAsUser: 18002
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /dev/shm
      name: dshm
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-8768r
      readOnly: true

gandharv-kapoor avatar May 29 '22 09:05 gandharv-kapoor

Hi, guys!
 Is there any updates on this issue? I’m experiencing the same issue after custom stub build. Python backend is stuck on the moment of TRITONBACKEND_ModelInstanceInitialize. I’m running on Ubuntu 20.04 inside of the native container, I’ve tested several version of tritonserver and python versions, but in all cases, it’s getting stuck on this step. However, when I’m doing the same steps on MacOs inside of the native container, stub is loaded successfully. I’ve tried different docker run parameters, but results remain the same. I’ve tested the following versions: 21.09, 21.12, 22.03; and python versions: 3.7, 3.8 3.9. Is there a solution or workaround for this one? Attaching last logs below:

I0617 16:14:31.943520 44 python.cc:1620] Using Python execution env /opt/tritonserver/backends/python/models/MyModel/miniconda3.tar.gz
I0617 16:14:31.943930 44 python.cc:1905] TRITONBACKEND_ModelInstanceInitialize: MyModel_0 (CPU device 0)
I0617 16:14:31.943966 44 backend_model_instance.cc:68] Creating instance MyModel_0 on CPU using artifact ''
I0617 16:14:43.217319 46 python.cc:1208] Starting Python backend stub: source /tmp/python_env_v5Zwly/0/bin/activate && exec env LD_LIBRARY_PATH=/tmp/python_env_v5Zwly/0/lib:$LD_LIBRARY_PATH /opt/tritonserver/backends/python/models/MyModel/triton_python_backend_stub /opt/tritonserver/backends/python/models/MyModel/1/model.py triton_python_backend_shm_region_1 67108864 67108864 44 /opt/tritonserver/backends/python 56 MyModel_0

randomguy2022 avatar Jun 23 '22 10:06 randomguy2022

@gandharv-kapoor Your issue looks like a different problem since I don't see logs from Python backend. It looks like you are using an ONNX model. Could you please open a separate issue if it has not been resolved yet.

@randomguy2022 Are you compiling the same version of the stub process and Triton version. For example, if you are using the 22.05 branch of Triton you need to make sure you are compiling the 22.05 branch of Python backend as well.

Tabrizian avatar Jun 24 '22 14:06 Tabrizian

@Tabrizian I've tested on 22.03 version for all packages, but problem remains the same.

randomguy2022 avatar Jun 29 '22 15:06 randomguy2022

@randomguy2022 @kthui has fixed a number of issues related to Python backend hangs here and here. They will be included in 23.02 release. If they still didn't resolve the hang, please open a new issue.

Tabrizian avatar Jan 27 '23 16:01 Tabrizian