server icon indicating copy to clipboard operation
server copied to clipboard

Triton 22.03-py3 stuck at TRITONBACKEND_ModelInstanceInitialize on older Ubuntu 18.04

Open gandharv-kapoor opened this issue 3 years ago • 5 comments

Description Triton backend 22.03-py3 stuck at TRITONBACKEND_ModelInstanceInitialize on older Ubuntu 18.04. This occured when I was testing triton onnx backend on Azure VM's which still dont support latest ubuntu. Switching to Triton version 20.04-py3 resolved my issue



=============================
--
Mon, May 30 2022 7:56:26 pm | == Triton Inference Server ==
Mon, May 30 2022 7:56:26 pm | =============================
Mon, May 30 2022 7:56:26 pm |  
Mon, May 30 2022 7:56:26 pm | NVIDIA Release 22.05 (build 38317651)
Mon, May 30 2022 7:56:26 pm | Triton Server Version 2.22.0
Mon, May 30 2022 7:56:26 pm |  
Mon, May 30 2022 7:56:26 pm | Copyright (c) 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Mon, May 30 2022 7:56:26 pm |  
Mon, May 30 2022 7:56:26 pm | Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Mon, May 30 2022 7:56:26 pm |  
Mon, May 30 2022 7:56:26 pm | This container image and its contents are governed by the NVIDIA Deep Learning Container License.
Mon, May 30 2022 7:56:26 pm | By pulling and using the container, you accept the terms and conditions of this license:
Mon, May 30 2022 7:56:26 pm | https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
Mon, May 30 2022 7:56:26 pm |  
Mon, May 30 2022 7:56:26 pm | WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available.
Mon, May 30 2022 7:56:26 pm | Use the NVIDIA Container Toolkit to start this container with GPU support; see
Mon, May 30 2022 7:56:26 pm | https://docs.nvidia.com/datacenter/cloud-native/ .
Mon, May 30 2022 7:56:26 pm |  
Mon, May 30 2022 7:56:26 pm | W0531 02:56:26.616090 1 pinned_memory_manager.cc:236] Unable to allocate pinned system memory, pinned memory pool will not be available: CUDA driver version is insufficient for CUDA runtime version
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.616205 1 cuda_memory_manager.cc:115] CUDA memory pool disabled
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.616952 1 model_config_utils.cc:645] Server side auto-completed config: name: "onnx-model"
Mon, May 30 2022 7:56:26 pm | platform: "onnxruntime_onnx"
Mon, May 30 2022 7:56:26 pm | input {
Mon, May 30 2022 7:56:26 pm | name: "images"
Mon, May 30 2022 7:56:26 pm | data_type: TYPE_FP32
Mon, May 30 2022 7:56:26 pm | dims: -1
Mon, May 30 2022 7:56:26 pm | dims: -1
Mon, May 30 2022 7:56:26 pm | dims: -1
Mon, May 30 2022 7:56:26 pm | }
Mon, May 30 2022 7:56:26 pm | output {
Mon, May 30 2022 7:56:26 pm | name: "boxes"
Mon, May 30 2022 7:56:26 pm | data_type: TYPE_FP32
Mon, May 30 2022 7:56:26 pm | dims: -1
Mon, May 30 2022 7:56:26 pm | dims: 4
Mon, May 30 2022 7:56:26 pm | }
Mon, May 30 2022 7:56:26 pm | output {
Mon, May 30 2022 7:56:26 pm | name: "labels"
Mon, May 30 2022 7:56:26 pm | data_type: TYPE_INT64
Mon, May 30 2022 7:56:26 pm | dims: -1
Mon, May 30 2022 7:56:26 pm | }
Mon, May 30 2022 7:56:26 pm | output {
Mon, May 30 2022 7:56:26 pm | name: "scores"
Mon, May 30 2022 7:56:26 pm | data_type: TYPE_FP32
Mon, May 30 2022 7:56:26 pm | dims: -1
Mon, May 30 2022 7:56:26 pm | }
Mon, May 30 2022 7:56:26 pm | output {
Mon, May 30 2022 7:56:26 pm | name: "caption_locations"
Mon, May 30 2022 7:56:26 pm | data_type: TYPE_INT64
Mon, May 30 2022 7:56:26 pm | dims: -1
Mon, May 30 2022 7:56:26 pm | }
Mon, May 30 2022 7:56:26 pm | default_model_filename: "model.onnx"
Mon, May 30 2022 7:56:26 pm | backend: "onnxruntime"
Mon, May 30 2022 7:56:26 pm |  
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.617010 1 model_repository_manager.cc:898] AsyncLoad() 'onnx-model'
Mon, May 30 2022 7:56:26 pm | W0531 02:56:26.617092 1 model_repository_manager.cc:315] ignore version directory 'docs' which fails to convert to integral number
Mon, May 30 2022 7:56:26 pm | W0531 02:56:26.617111 1 model_repository_manager.cc:315] ignore version directory 'test_dataset' which fails to convert to integral number
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.617123 1 model_repository_manager.cc:1136] TriggerNextAction() 'onnx-model' version 1: 1
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.617130 1 model_repository_manager.cc:1172] Load() 'onnx-model' version 1
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.617133 1 model_repository_manager.cc:1191] loading: onnx-model:1
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.717313 1 model_repository_manager.cc:1249] CreateModel() 'onnx-model' version 1
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.717399 1 backend_model.cc:292] Adding default backend config setting: default-max-batch-size,4
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.717419 1 shared_library.cc:108] OpenLibraryHandle: /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.718380 1 onnxruntime.cc:2426] TRITONBACKEND_Initialize: onnxruntime
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.718400 1 onnxruntime.cc:2436] Triton TRITONBACKEND API version: 1.9
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.718406 1 onnxruntime.cc:2442] 'onnxruntime' TRITONBACKEND API version: 1.9
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.718410 1 onnxruntime.cc:2472] backend configuration:
Mon, May 30 2022 7:56:26 pm | {"cmdline":{"auto-complete-config":"false","min-compute-capability":"6.000000","backend-directory":"/opt/tritonserver/backends","default-max-batch-size":"4"}}
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.728513 1 onnxruntime.cc:2507] TRITONBACKEND_ModelInitialize: onnx-model (version 1)
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.729581 1 model_config_utils.cc:1597] ModelConfig 64-bit fields:
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.729597 1 model_config_utils.cc:1599] ModelConfig::dynamic_batching::default_queue_policy::default_timeout_microseconds
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.729600 1 model_config_utils.cc:1599] ModelConfig::dynamic_batching::max_queue_delay_microseconds
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.729603 1 model_config_utils.cc:1599] ModelConfig::dynamic_batching::priority_queue_policy::value::default_timeout_microseconds
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.729606 1 model_config_utils.cc:1599] ModelConfig::ensemble_scheduling::step::model_version
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.729609 1 model_config_utils.cc:1599] ModelConfig::input::dims
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.729611 1 model_config_utils.cc:1599] ModelConfig::input::reshape::shape
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.729616 1 model_config_utils.cc:1599] ModelConfig::instance_group::secondary_devices::device_id
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.729620 1 model_config_utils.cc:1599] ModelConfig::model_warmup::inputs::value::dims
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.729624 1 model_config_utils.cc:1599] ModelConfig::optimization::cuda::graph_spec::graph_lower_bound::input::value::dim
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.729627 1 model_config_utils.cc:1599] ModelConfig::optimization::cuda::graph_spec::input::value::dim
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.729631 1 model_config_utils.cc:1599] ModelConfig::output::dims
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.729636 1 model_config_utils.cc:1599] ModelConfig::output::reshape::shape
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.729640 1 model_config_utils.cc:1599] ModelConfig::sequence_batching::direct::max_queue_delay_microseconds
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.729644 1 model_config_utils.cc:1599] ModelConfig::sequence_batching::max_sequence_idle_microseconds
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.729648 1 model_config_utils.cc:1599] ModelConfig::sequence_batching::oldest::max_queue_delay_microseconds
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.729651 1 model_config_utils.cc:1599] ModelConfig::sequence_batching::state::dims
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.729653 1 model_config_utils.cc:1599] ModelConfig::sequence_batching::state::initial_state::dims
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.729656 1 model_config_utils.cc:1599] ModelConfig::version_policy::specific::versions
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.729795 1 onnxruntime.cc:2550] TRITONBACKEND_ModelInstanceInitialize: onnx-model (CPU device 0)
Mon, May 30 2022 7:56:26 pm | I0531 02:56:26.729808 1 backend_model_instance.cc:68] Creating instance onnx-model on CPU using artifact 'model.onnx'
Mon, May 30 2022 7:56:26 pm | 2022-05-31 02:56:26.730251261 [I:onnxruntime:, inference_session.cc:324 operator()] Flush-to-zero and denormal-as-zero are off
Mon, May 30 2022 7:56:26 pm | 2022-05-31 02:56:26.730269362 [I:onnxruntime:, inference_session.cc:331 ConstructorCommon] Creating and using per session threadpools since use_per_session_threads_ is true
Mon, May 30 2022 7:56:26 pm | 2022-05-31 02:56:26.730277162 [I:onnxruntime:, inference_session.cc:351 ConstructorCommon] Dynamic block base set to 0

Pod.Yaml

- image: cloudopscontainerregistrydevglobal.azurecr.io/tab-detection:505921
    imagePullPolicy: IfNotPresent
    name: tab-detection
    resources:
      limits:
        cpu: "5"
        memory: 5Gi
      requests:
        cpu: "5"
        memory: 5Gi
    securityContext:
      privileged: true
      runAsGroup: 18003
      runAsUser: 18002
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /dev/shm
      name: dshm
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-8768r
      readOnly: true

Triton Information Triton 22.03-py3

To Reproduce Use latest triton release on Ubuntu 18.04

Expected behavior Proper failure in the logs, also have a proper guideline and whats supported and plans on being backward compatible.

Going forward I would like to understand what will be support like for different versions of Ubuntu specially in cases when there are import things like security patch needed?

gandharv-kapoor avatar Jun 09 '22 09:06 gandharv-kapoor

I don't see the reported error in your logs, which seem to be of 22.05. Your post lists 22.03 and 20.04.

In any case, the support matrix is available here: https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/rel_22-05.html#rel_22-05. Ubuntu 20.04 has been supported since Triton 20.12. We test against the supported stack and we rarely respin older containers (except for critical Triton patches).

dyastremsky avatar Jun 13 '22 16:06 dyastremsky

Are you building Triton from source? The newer Triton docker image are built against ubuntu 20.04 and it is not likely to work directly on ubuntu 18.04. Please check out this documentation for building against other platform

GuanLuo avatar Jun 15 '22 00:06 GuanLuo

We are using nvcr.io/nvidia/tritonserver:22.03-py3. The question is will there be continued support to Ubuntu 18.04? If not, is the recommendation to build our own image based on the section "unsupported platform"?

mikelam92 avatar Jun 15 '22 20:06 mikelam92

We no longer build and test against Ubuntu 18.04, so you will need to build from source for Ubuntu 18.04. You may want to use build.py first to check if it builds successfully, but the chance is low as some of the packages are named differently in Ubuntu 18.04. If that is the case, then yes the recommendation is to build from source based on the section "unsupported platform".

I am also curious whether Azure VM can run a docker container, if it can, then you should be able to run the docker image directly regardless of the Ubuntu version on the host system.

GuanLuo avatar Jun 16 '22 22:06 GuanLuo

I don't see the reported error in your logs, which seem to be of 22.05. Your post lists 22.03 and 20.04.

In any case, the support matrix is available here: https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/rel_22-05.html#rel_22-05. Ubuntu 20.04 has been supported since Triton 20.12. We test against the supported stack and we rarely respin older containers (except for critical Triton patches).

The logs might be from 22.05 but the case is same with any version > 20.04-py3 on Ubuntu 18.04. Also I mentioned that there was no error in this scenario, on taking the backtrace we realized it was stuck in the pthread library. Going forward I would like to understand what will be support like for different versions of Ubuntu specially in cases when there are import things like security patch needed?

gandharv-kapoor avatar Jun 17 '22 00:06 gandharv-kapoor

We do security scans and modify/upgrade components as necessary. As future major versions of Ubuntu get stable, secure, and widely-used, we'll likely move to them. Those moves don't happen often, as you can see from the matrix (18.04 support for >1 year, followed by 20.04 support for 1.5 years so far). You can always find what's supported in each release in the support matrix.

dyastremsky avatar Feb 23 '23 17:02 dyastremsky