QI JUN comments

Results 75 comments of


                                            QI JUN

How to disable KV cache for LLM

There is a similar issue: https://github.com/NVIDIA/TensorRT-LLM/issues/1422 We are actively working on it.

batch inference is different with single

Hi @1096125073 , since different batch sizes may lead to different kernels. So, the results can be different. This is a known issue.

batch inference is different with single

@1096125073 Yes, I get your point: repeat the same input prompt 4 times, and make it a batch, but the outputs are different from batch size 1. Unfortunately, it's a...

batch inference is different with single

Hi @1096125073 , I tried the llama2 model: ```bash python convert_checkpoint.py --model_dir=/llm-models/llama-models-v2/llama-v2-7b-hf/ --output_dir=./ckpt --dtype bfloat16 trtllm-build --checkpoint_dir=./ckpt --output_dir=./engine --gemm_plugin bfloat16 --max_output_len=256 --max_batch_size=4 python ../run.py --engine_dir=./engine --max_output_len=10 --tokenizer_dir /llm-models/llama-models-v2/llama-v2-7b-hf/ --input_text 'How...

batch inference is different with single

@1096125073 Could you please try the main branch? It seems you are using 0.9.0 version.