TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

Question about Orchestrator mode

Open akhoroshev opened this issue 1 year ago • 3 comments

Executor api introduces Leader and Orchestrator modes.

Leader works via mpi. How Orchestrator mode is implemented? Does it uses mpi itself? Which mode is preferable for performance: Leader or Orchestrator?

akhoroshev avatar May 15 '24 10:05 akhoroshev

Is it possible to have one process with multiple Executors with Orchestrator mode inside?

akhoroshev avatar May 15 '24 10:05 akhoroshev

We introduced orchestrator mode to simplify the deployment of multiple TRT-LLM model instances. For deploying a single TRT-LLM model instance, we recommend leader mode since orchestrator mode requires additional communications between the orchestrator process and the lead worker process.

Both leader and orchestrator modes use MPI currently. Leader mode relies on the user using mpirun to launch TP*PP MPI ranks to serve a model with TP,PP. With orchestrator mode, MPI_Comm_spawn is used to spawn the required processes to serve the model. See https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/cpp/executor#multi-gpu-run for example usage of leader and orchestrator modes.

With orchestrator mode, it's possible for a single process to create multiple tensorrt_llm::executor::Executor instances, each using their own engine, and their own GPUs. For example, you could modify the executorExampleAdvanced example as follows to create multiple executor instances with orchestrator mode.

   ...
    // Create orchestratorConfig
    auto orchestratorConfig = tle::OrchestratorConfig(true, runtimeOpts.workerExecutablePath);

    // First executor instance using GPUs 0,1
    auto executorConfig = tle::ExecutorConfig(runtimeOpts.beamWidth);
    auto parallelConfig = tle::ParallelConfig(tle::CommunicationType::kMPI, tle::CommunicationMode::kORCHESTRATOR,
        std::vector<tle::SizeType32>({0, 1}), std::nullopt, orchestratorConfig);
    executorConfig.setParallelConfig(parallelConfig);
    auto executor = tle::Executor(runtimeOpts.trtEnginePath, tle::ModelType::kDECODER_ONLY, executorConfig);

    // Second executor instance using GPUs 2,3
    auto executorConfig2 = tle::ExecutorConfig(runtimeOpts.beamWidth);
    auto parallelConfig2 = tle::ParallelConfig(tle::CommunicationType::kMPI, tle::CommunicationMode::kORCHESTRATOR,
        std::vector<tle::SizeType32>({2, 3}), std::nullopt, orchestratorConfig);
    executorConfig2.setParallelConfig(parallelConfig2);
    auto executor2 = tle::Executor(runtimeOpts.trtEnginePath, tle::ModelType::kDECODER_ONLY, executorConfig2);

    // Create the requests
    auto requestIds = enqueueRequests(runtimeOpts, executor);
    auto requestIds2 = enqueueRequests(runtimeOpts, executor2);

    // Wait for responses and store output tokens
    auto outputTokens = waitForResponses(runtimeOpts, requestIds, executor);
    auto outputTokens2 = waitForResponses(runtimeOpts, requestIds2, executor2);

    // Write output tokens csv file
    writeOutputTokens("output_tokens.csv", requestIds, outputTokens, runtimeOpts.beamWidth);
    writeOutputTokens("output_tokens2.csv", requestIds2, outputTokens2, runtimeOpts.beamWidth);

The executorWorker executable is provided with the pip wheel. e.g. can be found at:

/usr/local/lib/python3.10/dist-packages/tensorrt_llm/bin/executorWorker

pcastonguay avatar May 21 '24 13:05 pcastonguay

Thanks for the answer

akhoroshev avatar May 22 '24 20:05 akhoroshev

@pcastonguay Hello, I meet a problem about orchestrator mode; I am using Triton Inference Server with TensorRT-LLM as the backend to deploy two LLaMA models on a single GPU (A40) in orchestrator mode, and everything works fine. However, when I start multiple services on different GPUs (one service per GPU,and all GPUs are on the same server), the GPU utilization rate drops. Using nsys profile, I found that the GPU is waiting for requests; Can you tell me the reason ?

GaryGao99 avatar Feb 27 '25 10:02 GaryGao99