pcastonguay

Results 52 comments of pcastonguay

@pfldy2850 could you share your build.py command? Are you using the Triton tensorrt_llm backend? If so could you also share the config.pbtxt for the `tensorrt_llm` model? You should have a...

When using the `GUARANTEED_NO_EVICT` scheduling policy (https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/perf_best_practices.md#batch-scheduler-policy) the scheduler will only schedule a request if the KV cache has enough blocks to drive that request to completion (it assumes the...

Hi, thanks for reporting this issue. I haven't been able to reproduce on latest `main` on 2xA100. What ` --max_batch_size` value did you use (it's not specified in the build...

I also just tested on 2xA30 and cannot reproduce using latest `main` following the instructions shared above. ``` mpirun -n 2 --allow-run-as-root ./gptManagerBenchmark --engine_dir ../../../examples/llama/tmp/llama/13B/trt_engines/fp16/2-gpu/ --dataset ../../../benchmarks/cpp/token-norm-dist.json --kv_cache_free_gpu_mem_fraction 0.85 --enable_kv_cache_reuse...

We introduced orchestrator mode to simplify the deployment of multiple TRT-LLM model instances. For deploying a single TRT-LLM model instance, we recommend leader mode since orchestrator mode requires additional communications...

hi @HalteroXHunter, would it be possible for you to try with 24.01 branch? Also could you launch the triton server with the `--log` option and share the `triton_log.txt` log file?...

To get logits back from TRT-LLM, you would need to build your engine with `--gather_all_token_logits` option. See `Optional outputs` section here: https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/gpt_runtime.md If using the Triton backend, you would also...

Currently, the C++ Triton backend only accepts batch size 1 requests. We use in-flight batching to create larger batches from those batch size 1 requests. We don't have a timeline...

Based on the error you shared, the `deviceId` specified in your config.pbtxt is incorrect. What's the output of `nvidia-smi` inside the container? I'm assuming the deviceId is 0 inside the...

You could just send multiple requests, each request containing a single sentence.