TensorRT-LLM
TensorRT-LLM copied to clipboard
Question about Orchestrator mode
Executor api introduces Leader and Orchestrator modes.
Leader works via mpi. How Orchestrator mode is implemented? Does it uses mpi itself? Which mode is preferable for performance: Leader or Orchestrator?
Is it possible to have one process with multiple Executors with Orchestrator mode inside?
We introduced orchestrator mode to simplify the deployment of multiple TRT-LLM model instances. For deploying a single TRT-LLM model instance, we recommend leader mode since orchestrator mode requires additional communications between the orchestrator process and the lead worker process.
Both leader and orchestrator modes use MPI currently. Leader mode relies on the user using mpirun to launch TP*PP MPI ranks to serve a model with TP,PP. With orchestrator mode, MPI_Comm_spawn is used to spawn the required processes to serve the model. See https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/cpp/executor#multi-gpu-run for example usage of leader and orchestrator modes.
With orchestrator mode, it's possible for a single process to create multiple tensorrt_llm::executor::Executor instances, each using their own engine, and their own GPUs. For example, you could modify the executorExampleAdvanced example as follows to create multiple executor instances with orchestrator mode.
...
// Create orchestratorConfig
auto orchestratorConfig = tle::OrchestratorConfig(true, runtimeOpts.workerExecutablePath);
// First executor instance using GPUs 0,1
auto executorConfig = tle::ExecutorConfig(runtimeOpts.beamWidth);
auto parallelConfig = tle::ParallelConfig(tle::CommunicationType::kMPI, tle::CommunicationMode::kORCHESTRATOR,
std::vector<tle::SizeType32>({0, 1}), std::nullopt, orchestratorConfig);
executorConfig.setParallelConfig(parallelConfig);
auto executor = tle::Executor(runtimeOpts.trtEnginePath, tle::ModelType::kDECODER_ONLY, executorConfig);
// Second executor instance using GPUs 2,3
auto executorConfig2 = tle::ExecutorConfig(runtimeOpts.beamWidth);
auto parallelConfig2 = tle::ParallelConfig(tle::CommunicationType::kMPI, tle::CommunicationMode::kORCHESTRATOR,
std::vector<tle::SizeType32>({2, 3}), std::nullopt, orchestratorConfig);
executorConfig2.setParallelConfig(parallelConfig2);
auto executor2 = tle::Executor(runtimeOpts.trtEnginePath, tle::ModelType::kDECODER_ONLY, executorConfig2);
// Create the requests
auto requestIds = enqueueRequests(runtimeOpts, executor);
auto requestIds2 = enqueueRequests(runtimeOpts, executor2);
// Wait for responses and store output tokens
auto outputTokens = waitForResponses(runtimeOpts, requestIds, executor);
auto outputTokens2 = waitForResponses(runtimeOpts, requestIds2, executor2);
// Write output tokens csv file
writeOutputTokens("output_tokens.csv", requestIds, outputTokens, runtimeOpts.beamWidth);
writeOutputTokens("output_tokens2.csv", requestIds2, outputTokens2, runtimeOpts.beamWidth);
The executorWorker executable is provided with the pip wheel. e.g. can be found at:
/usr/local/lib/python3.10/dist-packages/tensorrt_llm/bin/executorWorker
Thanks for the answer
@pcastonguay Hello, I meet a problem about orchestrator mode; I am using Triton Inference Server with TensorRT-LLM as the backend to deploy two LLaMA models on a single GPU (A40) in orchestrator mode, and everything works fine. However, when I start multiple services on different GPUs (one service per GPU,and all GPUs are on the same server), the GPU utilization rate drops. Using nsys profile, I found that the GPU is waiting for requests; Can you tell me the reason ?