server icon indicating copy to clipboard operation
server copied to clipboard

Casting NumPy string array to np_utils.Tensor disproportionately increases latency

Open LLautenbacher opened this issue 1 year ago • 5 comments

Description Casting a NumPy string array to np_utils.Tensor using the python backend causes a disproportionate increase in latency (~300x).

Triton Information nvcr.io/nvidia/tritonserver:23.05-py3 This also still happens in 24.03.

To Reproduce When using the model and config below I get a latency of 9873 usec using the perf_analyzer. Uncommenting the line pb_utils.Tensor("annotation", arr_s) causes the latency to increase to 2888440 usec. Creating the NumPy array doesn't seem to matter only casting it to a tensor is what causes the slowdown.

model.py

import triton_python_backend_utils as pb_utils
import numpy as np
import json


class TritonPythonModel:
    def initialize(self, args):
        self.model_config = json.loads(args["model_config"])
        output0_config = pb_utils.get_output_config_by_name(
            self.model_config, "annotation"
        )
        self.output_dtype = pb_utils.triton_string_to_numpy(output0_config["data_type"])

    def execute(self, requests):
        responses = []
        for request in requests:
            batchsize = (
                pb_utils.get_input_tensor_by_name(request, "input0").as_numpy().shape[0]
            )
            arr_s = np.empty((batchsize, 256), dtype=np.dtype("S5"))
            arr_f = np.empty((batchsize, 256), dtype=np.dtype("float64"))
            # pb_utils.Tensor("annotation", arr_s)
            t = pb_utils.Tensor("annotation", arr_f)
            responses.append(pb_utils.InferenceResponse(output_tensors=[t]))
        return responses

    def finalize(self):
        pass

max_batch_size: 1000
input [
  {
    name: 'input0',
    data_type: TYPE_INT32,
    dims: [1],
  }
]
output [
 {
   name: 'annotation',
   data_type: TYPE_FP64,
   dims: [174]
 }
]

Expected behavior Returning a string array shouldn't take 300x as long as a float array.

LLautenbacher avatar Apr 24 '24 15:04 LLautenbacher

Hi @LLautenbacher, thanks for raising this issue with such detail.

@Tabrizian @krishung5 may be able to chime in here.

Is is possible this commented line is causing an extra copy? Also, can you elaborate on this datatype np.dtype("S5")? Is it required, and do you see different behavior if you use something like np.object_ instead?

rmccorm4 avatar May 01 '24 00:05 rmccorm4

Thank you for looking into this!

The specific string datatype is not relevant. U S and O all show this behaviour.

LLautenbacher avatar May 01 '24 14:05 LLautenbacher

I think I figured out what causes this. The de/serialization using dynamic byte lengths in deserialize_bytes_tensor and serialize_byte_tensor is much slower than using np.frombuffer and np.array.tobytes. In my use case of lots (array shape (1000, 174)) of short (<5 character) strings. This takes ~8000x as much time if you consider both encoding & decoding.

Is it possible to support fixed length string datatypes in Triton? That would massively speed up inference for me!

LLautenbacher avatar May 09 '24 17:05 LLautenbacher

Unfortunately, Triton has it is own serialization/deserialization for TYPE bytes tensors which is likely why you're observing slowdown.

Is it possible to use TYPE_UINT8 if you just want to transfer arbitrary byte blobs? This shouldn't have deserialization/serialization overheads associated with TYPE_BYTES.

Tabrizian avatar May 09 '24 17:05 Tabrizian

That would theoretically be possible, but for our use case, the overhead on the client side makes this infeasible. One of the main draws for us to use Triton is how easy it is to interface with it irrespective of the client. If we need to include for all interfacing clients the information for which models which arrays are actually string types, it would lose a lot of its appeal.

LLautenbacher avatar May 09 '24 17:05 LLautenbacher