pcastonguay
pcastonguay
@pfldy2850 could you share your build.py command? Are you using the Triton tensorrt_llm backend? If so could you also share the config.pbtxt for the `tensorrt_llm` model? You should have a...
When using the `GUARANTEED_NO_EVICT` scheduling policy (https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/perf_best_practices.md#batch-scheduler-policy) the scheduler will only schedule a request if the KV cache has enough blocks to drive that request to completion (it assumes the...
Hi, thanks for reporting this issue. I haven't been able to reproduce on latest `main` on 2xA100. What ` --max_batch_size` value did you use (it's not specified in the build...
I also just tested on 2xA30 and cannot reproduce using latest `main` following the instructions shared above. ``` mpirun -n 2 --allow-run-as-root ./gptManagerBenchmark --engine_dir ../../../examples/llama/tmp/llama/13B/trt_engines/fp16/2-gpu/ --dataset ../../../benchmarks/cpp/token-norm-dist.json --kv_cache_free_gpu_mem_fraction 0.85 --enable_kv_cache_reuse...
We introduced orchestrator mode to simplify the deployment of multiple TRT-LLM model instances. For deploying a single TRT-LLM model instance, we recommend leader mode since orchestrator mode requires additional communications...
hi @HalteroXHunter, would it be possible for you to try with 24.01 branch? Also could you launch the triton server with the `--log` option and share the `triton_log.txt` log file?...
To get logits back from TRT-LLM, you would need to build your engine with `--gather_all_token_logits` option. See `Optional outputs` section here: https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/gpt_runtime.md If using the Triton backend, you would also...
Currently, the C++ Triton backend only accepts batch size 1 requests. We use in-flight batching to create larger batches from those batch size 1 requests. We don't have a timeline...
Based on the error you shared, the `deviceId` specified in your config.pbtxt is incorrect. What's the output of `nvidia-smi` inside the container? I'm assuming the deviceId is 0 inside the...
You could just send multiple requests, each request containing a single sentence.