server
server copied to clipboard
Tritonserver hangs on launch with python backend
Description I am trying to use the triton server on cpu only model and during launch the server will launch perfectly with only ONNX models but the moment I include a python backend model it hangs on launch eternally.
I am using an Apple M2 Mac.
It is worth noting that the model runs when I use the Sagemaker Triton Server Image on a sagemaker multimodel endpoint.
Triton Information
Version? 23.02 although have also tried 24.04.
Are you using the Triton container or did you build it yourself?
Container. Specifically nvcr.io/nvidia/tritonserver:23.02-py3.
To Reproduce
- Pull nvcr.io/nvidia/tritonserver:23.02-py3
- Create models_repository with a python backend model.
- Launch the docker container without starting the server using
docker run -it -p8000:8000 -p8001:8001 -p8002:8002 -v/Users/jamesbower/Projects/triton-local/model_repository:/models nvcr.io/nvidia/tritonserver:23.02-py3 /bin/bash - Install python packages as system packages
pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cpuandpip install --no-cache-dir numpy. - CD to where models directory is loacated.
- Run
tritonserver --model-repository models/
Output
The following is displayed:
W0524 08:06:50.694589 82 pinned_memory_manager.cc:236] Unable to allocate pinned system memory, pinned memory pool will not be available: no CUDA-capable device is detected
I0524 08:06:50.694772 82 cuda_memory_manager.cc:115] CUDA memory pool disabled
I0524 08:06:50.705575 82 model_lifecycle.cc:459] loading: baai_quant_onnx:1
I0524 08:06:50.706361 82 model_lifecycle.cc:459] loading: forced_alignment:1
I0524 08:06:50.707123 82 model_lifecycle.cc:459] loading: titanet_small_onnx:1
I0524 08:06:50.707594 82 onnxruntime.cc:2459] TRITONBACKEND_Initialize: onnxruntime
I0524 08:06:50.707624 82 onnxruntime.cc:2469] Triton TRITONBACKEND API version: 1.11
I0524 08:06:50.707628 82 onnxruntime.cc:2475] 'onnxruntime' TRITONBACKEND API version: 1.11
I0524 08:06:50.707632 82 onnxruntime.cc:2505] backend configuration:
{"cmdline":{"auto-complete-config":"true","min-compute-capability":"6.000000","backend-directory":"/opt/tritonserver/backends","default-max-batch-size":"4"}}
I0524 08:06:50.718814 82 onnxruntime.cc:2563] TRITONBACKEND_ModelInitialize: baai_quant_onnx (version 1)
I0524 08:06:50.719420 82 onnxruntime.cc:666] skipping model configuration auto-complete for 'baai_quant_onnx': inputs and outputs already specified
It just hangs here eternally.
Expected behavior
Triton server launched completely such that curl -v localhost:8000/v2/health/ready receives a status 200 response.
Model Repository Setup
The structure of the model repository is:
model_repository/
|
|--baai_quant_onnx
| |--1
| | |--model.onnx
| | |--labels.json
| | |--wav2vec2_asr_base_960h.pt
| |--config.pbtxt
|
|--titanet_small_onnx
| |--1
| | |-model.onnx
| |--config.pbtxt
|
|--forced_alignment
| |--1
| | |-model.py
| |--config.pbtxt
I am not using a conda packed execution environment since I install the required packages in the container after launching it. I have also tried with a conda packed conda env though which is the method I used with the SageMaker Triton Server image.
The model.py file is
import triton_python_backend_utils as pb_utils
import numpy as np
import json
import torch
import re
import os
from dataclasses import dataclass
class TritonPythonModel:
def initialize(self, args):
self.word_output_type = pb_utils.triton_string_to_numpy(
pb_utils.get_output_config_by_name(json.loads(args["model_config"]), "word")["data_type"]
)
self.start_time_output_type = pb_utils.triton_string_to_numpy(
pb_utils.get_output_config_by_name(json.loads(args["model_config"]), "start_time")["data_type"]
)
self.end_time_output_type = pb_utils.triton_string_to_numpy(
pb_utils.get_output_config_by_name(json.loads(args["model_config"]), "end_time")["data_type"]
)
model_repository = args["model_repository"]
wav2vec2_path = os.path.join(model_repository,"1","wav2vec2_asr_base_960h.pt")
labels_path = os.path.join(model_repository,"1","labels.json")
self.wav2vecmodel = torch.jit.load(wav2vec2_path).eval()
with open(labels_path, "r") as f:
self.labels = tuple(json.load(f))
def execute(self, requests):
responses = []
for request in requests:
transcription = pb_utils.get_input_tensor_by_name(request, "transcription").as_numpy().squeeze(1).astype(str)
audio = pb_utils.get_input_tensor_by_name(request, "audio").as_numpy()
# Optional preprocessing code for inputs in standard Python...
transcription = transcription.tolist()[0]
transcript = convert_to_transcript(transcription)
with torch.inference_mode():
waveform = torch.tensor(audio, dtype=torch.float32)
emissions = calculate_emissions(self.wav2vecmodel, waveform)
emissions = emissions[0].cpu().detach().numpy()
waveform = waveform.cpu().detach().numpy()
dictionary = {c: i for i, c in enumerate(self.labels)}
tokens = [dictionary[c] for c in transcript]
ratio = waveform.shape[1] / emissions.shape[0]
trellis = get_trellis_numba(emissions, np.array(tokens))
path = backtrack_numba(trellis, emissions, tokens)
path = [Point(*p) for p in path]
segments = merge_repeats(path, transcript, ratio)
word_segments = merge_words(segments)
words = []
start_times = []
end_times = []
for segment in word_segments:
words.append(segment.label)
start_times.append(segment.start_time)
end_times.append(segment.end_time)
words = np.array([words]).astype(self.word_output_type)
start_times = np.array([start_times]).astype(self.start_time_output_type)
end_times = np.array([end_times]).astype(self.end_time_output_type)
output_tensor_words = pb_utils.Tensor("word", words)
output_tensor_start_times = pb_utils.Tensor("start_time", start_times)
output_tensor_end_times = pb_utils.Tensor("end_time", end_times)
response = pb_utils.InferenceResponse(
output_tensors=[output_tensor_words, output_tensor_start_times, output_tensor_end_times]
)
responses.append(response)
return responses
# Any cleanup code to be used when the model is unloaded. Not completely sure of the degree to which this is required currently.
def finalize(self):
print("Finalizing model...")
def convert_to_transcript(text: str):
text = text.upper().strip()
text = re.sub(r"[^A-Z0-9\s]", "", text)
text = "|" + re.sub(r"\s+", "|", text) + "|"
return text
def calculate_emissions(wav2vecmodel, waveform) -> torch.Tensor:
emissions, _ = wav2vecmodel(waveform)
emissions = torch.log_softmax(emissions, dim=-1)
return emissions
def get_trellis_numba(emission, tokens, blank_id=0):
num_frame = emission.shape[0]
num_tokens = len(tokens)
trellis = np.zeros((num_frame, num_tokens))
trellis[1:,0] = np.cumsum(emission[1:, blank_id])
trellis[0, 1:] = -np.inf
trellis[-num_tokens + 1:, 0] = np.inf
for t in range(num_frame - 1):
trellis[t + 1, 1:] = np.maximum(
# Score for staying at the same token
trellis[t, 1:] + emission[t, blank_id],
# Score for changing to the next token
trellis[t, :-1] + emission[t, tokens[1:]],
)
return trellis
def backtrack_numba(trellis, emission, tokens, blank_id=0):
t, j = trellis.shape[0] - 1, trellis.shape[1] - 1
path = [(j, t, np.exp(emission[t, blank_id]))]
while j > 0:
assert t > 0 # Should not happen but just in case
# 1. Figure out if the current position was stay or change
# Frame-wise score of stay vs change
p_stay = emission[t - 1, blank_id]
p_change = emission[t - 1, tokens[j]]
# Context-aware score for stay vs change
stayed = trellis[t - 1, j] + p_stay
changed = trellis[t - 1, j - 1] + p_change
# Update position
t -= 1
if changed > stayed:
j -= 1
# Store the path with frame-wise probability.
prob = np.exp(p_change if changed > stayed else p_stay)
path.append((j, t, prob))
# Now j == 0, which means it reached the SoS (Start of Sequence).
# Fill up the rest for the sake of visualization
while t > 0:
prob = np.exp(emission[t - 1, blank_id])
path.append((j, t - 1, prob))
t -= 1
return path[::-1]
@dataclass
class Point:
token_index: int
time_index: int
score: float
@dataclass
class Segment:
label: str
start: int
end: int
score: float
start_time: float = 0.0
end_time: float = 0.0
def __repr__(self):
return f"{self.label}\t({self.score:4.2f}): [{self.start:5d}, {self.end:5d}), {self.start_time:.2f}s"
@property
def length(self):
return self.end - self.start
def json(self):
return {
"label": self.label,
"start_time": self.start_time,
"end_time": self.end_time,
}
def merge_repeats(path, transcript, ratio):
i1, i2 = 0, 0
segments = []
while i1 < len(path):
while i2 < len(path) and path[i1].token_index == path[i2].token_index:
i2 += 1
score = sum(path[k].score for k in range(i1, i2)) / (i2 - i1)
segments.append(
Segment(
transcript[path[i1].token_index],
path[i1].time_index,
path[i2 - 1].time_index + 1,
score,
ratio * path[i1].time_index / 16_000,
(ratio * path[i2 - 1].time_index + 1) / 16_000
)
)
i1 = i2
return segments
def merge_words(segments, separator="|"):
words = []
i1, i2 = 0, 0
while i1 < len(segments):
if i2 >= len(segments) or segments[i2].label == separator:
if i1 != i2:
segs = segments[i1:i2]
word = "".join([seg.label for seg in segs])
score = sum(seg.score * seg.length for seg in segs) / sum(seg.length for seg in segs)
words.append(Segment(word, segments[i1].start, segments[i2 - 1].end, score, segments[i1].start_time,
segments[i2 - 1].end_time))
i1 = i2 + 1
i2 = i1
else:
i2 += 1
return words
The config.pbtxt is:
name: "forced_alignment"
backend: "python"
max_batch_size: 1
input [
{
name: "transcription"
data_type: TYPE_STRING
dims: [ 1 ]
},
{
name: "audio"
data_type: TYPE_FP32
dims: [ -1 ]
}
]
output [
{
name: "word"
data_type: TYPE_STRING
dims: [ -1 ]
},
{
name: "start_time"
data_type: TYPE_FP32
dims: [ -1 ]
},
{
name: "end_time"
data_type: TYPE_FP32
dims: [ -1 ]
}
]
instance_group {
count: 1
kind: KIND_CPU
}
Execution env is not set as I install the required packages in the container.
Hi @JamesBowerXanda, Triton doesn't officially support Mac, but I assume it would work if you are only running CPU-only model. I couldn't reproduce the hang using a linux machine. Since I don't have the wav2vec2_asr_base_960h.pt and labels.json files, I replaced the wav2vec2_asr_base_960h.pt with some model.pt and remove the line for labels.json, Triton is not hanging on my side. Could you run the server with --log-verbose=1 and see if there's any error reported in the log?
I also notice that the path for those two files might be incorrect
model_repository = args["model_repository"]
wav2vec2_path = os.path.join(model_repository,"1","wav2vec2_asr_base_960h.pt")
labels_path = os.path.join(model_repository,"1","labels.json")
The args["model_repository"] will return model_repository/forced_alignmen while the wav2vec2_asr_base_960h.pt and labels.json files are under model_repository/baai_quant_onnx/1/.
Closing due to lack of activity. Please re-open the issue if you would like to follow up with this issue.