pcastonguay comments

Results 4 comments of


                                            pcastonguay

Guides or Tips for optimization for KV cache usage with inflight batcher

@pfldy2850 could you share your build.py command? Are you using the Triton tensorrt_llm backend? If so could you also share the config.pbtxt for the `tensorrt_llm` model? You should have a...

Guides or Tips for optimization for KV cache usage with inflight batcher

When using the `GUARANTEED_NO_EVICT` scheduling policy (https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/perf_best_practices.md#batch-scheduler-policy) the scheduler will only schedule a request if the KV cache has enough blocks to drive that request to completion (it assumes the...

Encountered an error in forward function: slice 712 exceeds buffer size 471

Hi, thanks for reporting this issue. I haven't been able to reproduce on latest `main` on 2xA100. What ` --max_batch_size` value did you use (it's not specified in the build...

Encountered an error in forward function: slice 712 exceeds buffer size 471

I also just tested on 2xA30 and cannot reproduce using latest `main` following the instructions shared above. ``` mpirun -n 2 --allow-run-as-root ./gptManagerBenchmark --engine_dir ../../../examples/llama/tmp/llama/13B/trt_engines/fp16/2-gpu/ --dataset ../../../benchmarks/cpp/token-norm-dist.json --kv_cache_free_gpu_mem_fraction 0.85 --enable_kv_cache_reuse...