server [Bug] Model 'ensemble' receives inputs originated from different decoupled models

Description In a ensemble pipeline for TensorRT-LLM backend, when we try to propagate data from preprocessing model to the postprocessing model, we get this error Model 'ensemble' receives inputs originated from different decoupled models

Here's a summary of the problem:

We use TensorRT-LLM to compute the Llama3 model engine
This issue only occurs if the decoupled mode is enabled
We need to enable decoupled mode in order to use streaming feature

Triton Information NVIDIA Release 24.04 (build 90085495) Triton Server Version 2.45.0

Using the Triton image on a container

To Reproduce

Step 1: Enable decouped mode inside the tensorrt_llm\config.pbtxt:

name: "tensorrt_llm"
backend: "tensorrtllm"
max_batch_size: 64

model_transaction_policy {
  decoupled: true
}

Step 2: Add a new input field to the postprocessing model inside the postprocessing\config.pbtxt

name: "postprocessing"
backend: "python"
max_batch_size: 64
input [
  {
    name: "INPUT_TOKENS"
    data_type: TYPE_INT32
    dims: [ -1 ]
  },
  ...
]

Step 3: Try to inherited data from the preprocessing model/step inside the ensemble\config.pbtxt

ensemble_scheduling {
  step [
    {
      model_name: "preprocessing"
      model_version: -1
      ...
      output_map {
        key: "INPUT_ID"
        value: "_INPUT_ID"
      }
      ...
    },
    {
      model_name: "tensorrt_llm"
      model_version: -1
      ...
    },
    {
      model_name: "postprocessing"
      model_version: -1
      input_map {
        key: "INPUT_TOKENS" # add a new field which was propagated from "preprocessing"
        value: "_INPUT_ID"
      }
      ...
    }
  ]
}

Step 4: Start the Triton server, and we get the following error, which cause the server to shutdown.

E0525 08:29:56.598979 93 model_repository_manager.cc:579] Invalid argument: in ensemble ensemble, step of model 'ensemble' receives inputs originated from different decoupled models

Expected behavior Should be able to propagate data with in the ensemble pipeline without needing to disable decouped mode, since we need to use streaming feature.

May 25 '24 08:05 michaelnny

I have the exact problem trying to propagate the input to postprocessing model in the ensemble pipeline and getting the same error. Have you found a solution yet?

Aug 30 '24 19:08 adrian-tsang-elucid

@adrtsang no solution or workaround, I've moved on and using vLLM now, which seems to have much better community support.

Sep 01 '24 06:09 michaelnny

I am using an ensemble pipeline that serves a TensorRT model. The input to the pre-processing model is an image volume, and this needs to also be propagated to the post-processing model. Both the pre- and post-processing models have decoupled = True for the model_transaction_policy. Here's the config.pbtxt for the ensemble pipeline:

name: "ensemble_model"
platform: "ensemble"
max_batch_size: 0
input [
  {
    name: "input_image"
    data_type: TYPE_FP16
    dims: [ -1, -1, -1 ]
  }
]
output [
  {
    name: "postprocessed_image"
    data_type: TYPE_FP16
    dims: [ -1, -1, -1 ]
  }
]

ensemble_scheduling {
  step [
    {
      model_name: "whs_preprocess_model"
      model_version: -1
      input_map {
        key: "whs_preprocess_model_input_image"
        value: "input_image"
      }
      output_map {
        key: "whs_preprocess_model_patch"
        value: "whs_image_patch"
      }
      output_map {
        key: "whs_preprocess_model_pad_dimension"
        value: "whs_res_arr_pad_dimension"
      }
      output_map {
        key: "whs_preprocess_model_padding"
        value: "whs_slicer_to_padding"
      }
      output_map {
        key: "whs_preprocess_model_slicer"
        value: "whs_slicer"
      }
      output_map {
        key: "whs_preprocess_model_background"
        value: "whs_background_indices"
      }
    },
    {
      model_name: "whs_model"
      model_version: -1
      input_map {
        key: "input"
        value: "whs_image_patch"
      }
      output_map {
        key: "output"
        value: "whs_model_prediction"
      }
    },
    {
      model_name: "whs_postprocess_model"
      model_version: -1
      input_map {
        key: "whs_postprocess_input_prediction"
        value: "whs_model_prediction"
      }
      input_map {
        key: "whs_postprocess_model_pad_dimension"
        value: "whs_pad_dimension"
      }
      input_map {
        key: "whs_postprocess_model_padding"
        value: "whs_slicer_to_padding"
      }
      input_map {
        key: "whs_postprocess_model_slicer"
        value: "whs_slicer"
      }
      input_map {
        key: "whs_postprocess_model_background"
        value: "whs_background_indices"
      }
      input_map {
        key: "whs_postprocess_input_image"
        value: "input_image"
      }
      output_map {
        key: "whs_postprocessed_output"
        value: "postprocessed_image"
      }
    }
  ]
}

When loading this ensemble pipeline in the Triton server (nvcr.io/nvidia/tritonserver:24.07-py3), it gives an error:

E0903 17:36:36.480489 1 model_repository_manager.cc:614] "Invalid argument: in ensemble ensemble_model, step of model 'ensemble_model' receives inputs originated from different decoupled models"

How can I propagate the input image to both pre- and post-processing models?

Sep 03 '24 17:09 adrian-tsang-elucid

If I remember correctly, it does not work if we enable decouped mode in the ensemble model pipeline, not sure if this was by design or something.

You can try to disable it and see if it works, but doing so will also disable "streaming"

Sep 04 '24 04:09 michaelnny

I had the same receives inputs originated from different decoupled models issue and was able to resolve it. In my case, the issue was that I had used the same key name as an input in different models, but this doesn't work correctly with decoupled models.

Here is a simplified config.pbtxt example that was causing this error for me. Note:

Only the second model is decoupled in this case
All models need to work with the INPUT_JSON request

name: "minimal_example"
platform: "ensemble"
input [
  {
    name: "INPUT_JSON"
    data_type: TYPE_STRING
    dims: [ 1 ]
  }
]
output [
  {
    name: "OUTPUT_IDS"
    data_type: TYPE_INT16
    dims: [ -1 ]
  }
]
ensemble_scheduling {
  step [
    # Not decoupled
    {
      model_name: "fetch_signals"
      model_version: 1
      input_map {
        key: "SIGNAL_FETCH_INPUT_JSON"
        value: "INPUT_JSON"  # Input request JSON
      }
      output_map {
        key: "SIGNAL_FETCH_OUTPUT_SIGNAL"
        value: "SIGNAL_FETCH_OUTPUT_SIGNAL"
      }
    },
    # "inference" stage is decoupled
    { 
      model_name: "inference"
      model_version: 1
      input_map {
        key: "INFERENCE_INPUT_SIGNAL"
        value: "SIGNAL_FETCH_OUTPUT_SIGNAL"
      }
      input_map {
        key: "INFERENCE_INPUT_JSON"
        value: "INPUT_JSON"  # Input request JSON
      }
      output_map {
        key: "INFERENCE_OUTPUT"
        value: "INFERENCE_OUTPUT"
      }
    },
    # Not decoupled
    {
      model_name: "last_stage"
      model_version: 1
      input_map {
        key: "LAST_STAGE_INPUT"
        value: "INFERENCE_OUTPUT"
      }
      input_map {
        key: "LAST_STAGE_INPUT_JSON"
        value: "INPUT_JSON"  # Input request JSON
      }
      output_map {
        key: "LAST_STAGE_OUTPUT_IDS"
        value: "OUTPUT_IDS"
      }
    }
  ]
}

The issue here is that the last_stage takes both INFERENCE_OUTPUT and INPUT_JSON as inputs. This works fine in non-decoupled mode as we can just pass along INPUT_JSON from the overall ensemble input.

However, in decoupled mode, it seems like the last_stage (not itself decoupled, but following the decoupled inference stage) needs all its inputs provided by the previous decoupled inference stage. Otherwise, last_stage is receiving its inputs from different models: one that is decoupled and provides INFERENCE_OUTPUT, and one that is just from the input for INPUT_JSON.

This seems like a bug in Triton Inference Server. I worked around this by essentially having the second decoupled inference stage output all the tensors needed by the next non-decoupled stage. So the modified config.pbtxt makes the following changes:

# Other fields unchanged
ensemble_scheduling {
    # First stage "fetch_signals" is unchanged
    {...}
    # Decoupled "inference" stage re-outputs request JSON
    { 
      model_name: "inference"
      # All fields unchanged but add:
      output_map {
        # inference stage model.py will output "INFERENCE_INPUT_JSON"
        # which will be the same value as "INPUT_JSON"
        key: "INFERENCE_INPUT_JSON" 
        # Use a different output name than "INPUT_JSON" to fix the bug
        value: "INFERENCE_REQUEST_INPUT_JSON"
      }
    },
    # Not decoupled "last_stage" is changed to modify the input:
    {
      model_name: "last_stage"
      # All other fields unchanged but modify:
      input_map {
        key: "LAST_STAGE_INPUT_JSON"
        value: "INFERENCE_REQUEST_INPUT_JSON"  # Modified name
      }
    }
  ]
}

Essentially, the fix is for the decoupled stage to re-output INPUT_JSON as INFERENCE_INPUT_JSON, which the next stage then parses as INFERENCE_REQUEST_INPUT_JSON -> (rename) LAST_STAGE_INPUT_JSON.

This fix ensures all the inputs to the last_stage are produced by the previous decoupled inference stage and can be consumed at the same time.

In addition to the ensemble config.pbtxt change above, I also had to change the config.pbtxt for the inference stage to output INFERENCE_INPUT_JSON as well, and change its model.py code to fetch the input tensor and re-output it as INFERENCE_INPUT_JSON.

Hope this helps you demystify the cause of this bug in your case.

Sep 05 '24 18:09 lakshbhasin

Hi @lakshbhasin, I am able to implement the workaround above and got further in my ensemble pipeline. Thank you! However, I run into an issue where the output from the pre-processing model are not passing properly to the next stage (inference) in the pipeline. I get the following error when I run my client code: tritonclient.utils.InferenceServerException: in ensemble 'whs_ensemble_model', [request id: 1] input byte size mismatch for input 'whs_inference_input_background' for model 'whs_inference_model'. Expected 6658560, got 0 I return the outputs in the pre-processing model.py script using inference_response = pb_utils.InferenceResponse(output_tensors=[output1, output2, output3]) Any idea why zero bytes of input data are passed to the next stage in the pipeline and caused this exception?

Sep 16 '24 15:09 adrian-tsang-elucid

server server copied to clipboard

[Bug] Model 'ensemble' receives inputs originated from different decoupled models

server
server copied to clipboard