server icon indicating copy to clipboard operation
server copied to clipboard

why is only 1st 'batch' inferred?

Open gliufetch opened this issue 1 year ago • 2 comments

I have an ensemble model,

model 1 output are 66 cropped images, model 1 is python, I manually resize/padded them to 3 batches with shape (30, 3, 48, 320), (30, 3, 48, 976), (6, 3, 48, 1280) (I don't want to pad every image to 1280, because most of the time the 'cropped' images are much smaller than 1280)

config:

output [ { name: "detection_postprocessing_output" data_type: TYPE_FP32 dims: [-1, 3, 48, -1]
}

code (the output from model_1 is a list of pb_utils.Tensor instead of a pb_utils.Tensor since I cannot put all the images into a pb_utils.Tensor without padding them to the same shape):

         out_tensor_0 = [pb_utils.Tensor(
                "detection_postprocessing_output", batch.astype(np.float32)
            ) for batch in image_batches]
            print(f"Model 1: Number of outputs: {len(out_tensor_0)}, Output shapes: {[tensor.as_numpy().shape for tensor in out_tensor_0]}")
        inference_response = pb_utils.InferenceResponse(
            output_tensors=out_tensor_0
        )
        responses.append(inference_response)

`

model 2 is an onnx model,

config:

name: "text_recognition" platform: "onnxruntime_onnx" max_batch_size : 0 input [ { name: "x" data_type: TYPE_FP32 dims: [-1, 3, 48, -1] } ] output [ { name: "softmax_2.tmp_0" data_type: TYPE_FP32 dims: [-1, -1, 97 ] } ]

The inference only has output for the 1st batch. why is it like that? Is there a way to do dynamic batching for a list of images of different shape, without padding them to the same length? (dynamic batching works if all the inputs are padded to 1280 the same size, but that makes the image much larger).

if the image (which is the output from model_1) shapes are

(3, 48, 256), (3, 48, 448), (3, 48, 256), (3, 48, 224), (3, 48, 784), (3, 48, 64), (3, 48, 240), (3, 48, 192), (3, 48, 224), (3, 48, 448), (3, 48, 608), (3, 48, 384), (3, 48, 1088), (3, 48, 336), (3, 48, 912), (3, 48, 1296), (3, 48, 1360), (3, 48, 384), (3, 48, 384), (3, 48, 768), (3, 48, 576), (3, 48, 208), (3, 48, 48), (3, 48, 736),

could dynamic batching automatically batch the ones that are of the same shape and do the inference?

gliufetch avatar Jun 17 '24 14:06 gliufetch

The inference only has output for the 1st batch. why is it like that? Is there a way to do dynamic batching for a list of images of different shape, without padding them to the same length?

For Python and ONNX backends, you could use the ragged batching: https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/ragged_batching.html

Please note that for ONNX backend the ragged batching may require model support to understand the additional inputs needed for detecting batch boundaries.

Tabrizian avatar Sep 06 '24 14:09 Tabrizian

The inference only has output for the 1st batch. why is it like that? Is there a way to do dynamic batching for a list of images of different shape, without padding them to the same length?

For Python and ONNX backends, you could use the ragged batching: https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/ragged_batching.html

Please note that for ONNX backend the ragged batching may require model support to understand the additional inputs needed for detecting batch boundaries.

Can you provide an example how to process on Python backend? I don't know how I should handle the request. Normally I use "pb_utils.get_input_tensor_by_name" to get the input. Can I do that to get the "index" when config to use ragged batching? My model can accept 1D input. I have been researching for hours and haven't seen any example for Python. Do I process each part of the ragged array by index in for loop??? How does that make the workflow better?

Alonelymess avatar Sep 12 '24 07:09 Alonelymess

The inference only has output for the 1st batch. why is it like that? Is there a way to do dynamic batching for a list of images of different shape, without padding them to the same length?

For Python and ONNX backends, you could use the ragged batching: https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/ragged_batching.html Please note that for ONNX backend the ragged batching may require model support to understand the additional inputs needed for detecting batch boundaries.

Can you provide an example how to process on Python backend? I don't know how I should handle the request. Normally I use "pb_utils.get_input_tensor_by_name" to get the input. Can I do that to get the "index" when config to use ragged batching? My model can accept 1D input. I have been researching for hours and haven't seen any example for Python. Do I process each part of the ragged array by index in for loop??? How does that make the workflow better?

Same question. Do you know how to get batch_index INDEX in python backend?

xiaochus avatar Jan 13 '25 11:01 xiaochus