server
server copied to clipboard
Casting NumPy string array to np_utils.Tensor disproportionately increases latency
Description Casting a NumPy string array to np_utils.Tensor using the python backend causes a disproportionate increase in latency (~300x).
Triton Information nvcr.io/nvidia/tritonserver:23.05-py3 This also still happens in 24.03.
To Reproduce
When using the model and config below I get a latency of 9873 usec using the perf_analyzer. Uncommenting the line pb_utils.Tensor("annotation", arr_s) causes the latency to increase to 2888440 usec. Creating the NumPy array doesn't seem to matter only casting it to a tensor is what causes the slowdown.
model.py
import triton_python_backend_utils as pb_utils
import numpy as np
import json
class TritonPythonModel:
def initialize(self, args):
self.model_config = json.loads(args["model_config"])
output0_config = pb_utils.get_output_config_by_name(
self.model_config, "annotation"
)
self.output_dtype = pb_utils.triton_string_to_numpy(output0_config["data_type"])
def execute(self, requests):
responses = []
for request in requests:
batchsize = (
pb_utils.get_input_tensor_by_name(request, "input0").as_numpy().shape[0]
)
arr_s = np.empty((batchsize, 256), dtype=np.dtype("S5"))
arr_f = np.empty((batchsize, 256), dtype=np.dtype("float64"))
# pb_utils.Tensor("annotation", arr_s)
t = pb_utils.Tensor("annotation", arr_f)
responses.append(pb_utils.InferenceResponse(output_tensors=[t]))
return responses
def finalize(self):
pass
max_batch_size: 1000
input [
{
name: 'input0',
data_type: TYPE_INT32,
dims: [1],
}
]
output [
{
name: 'annotation',
data_type: TYPE_FP64,
dims: [174]
}
]
Expected behavior Returning a string array shouldn't take 300x as long as a float array.
Hi @LLautenbacher, thanks for raising this issue with such detail.
@Tabrizian @krishung5 may be able to chime in here.
Is is possible this commented line is causing an extra copy? Also, can you elaborate on this datatype np.dtype("S5")? Is it required, and do you see different behavior if you use something like np.object_ instead?
Thank you for looking into this!
The specific string datatype is not relevant. U S and O all show this behaviour.
I think I figured out what causes this. The de/serialization using dynamic byte lengths in deserialize_bytes_tensor and serialize_byte_tensor is much slower than using np.frombuffer and np.array.tobytes.
In my use case of lots (array shape (1000, 174)) of short (<5 character) strings. This takes ~8000x as much time if you consider both encoding & decoding.
Is it possible to support fixed length string datatypes in Triton? That would massively speed up inference for me!
Unfortunately, Triton has it is own serialization/deserialization for TYPE bytes tensors which is likely why you're observing slowdown.
Is it possible to use TYPE_UINT8 if you just want to transfer arbitrary byte blobs? This shouldn't have deserialization/serialization overheads associated with TYPE_BYTES.
That would theoretically be possible, but for our use case, the overhead on the client side makes this infeasible. One of the main draws for us to use Triton is how easy it is to interface with it irrespective of the client. If we need to include for all interfacing clients the information for which models which arrays are actually string types, it would lose a lot of its appeal.