Kefeng-Duan
Kefeng-Duan
Hi, @sdecoder Could you try to use --load_model_on_cpu?
@sdecoder Do you mean the weights are too big to be stored in on GPU (26GB > 24GB), so you need to offload some (or all ) weights to CPU?...
How about referring this one ?: https://github.com/NVIDIA/TensorRT-LLM/issues/1968#issuecomment-2252750163
@BooHwang https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/llama/README.md#run-llama-with-streamingllm sorry, could you try --streamingllm enable when building the engine?
Hi, @ayush1399 it seems a version mismatched issue, could you: 1. update to the latest commit 2. install the latest pypi 3. clean and rebuild trtllm 4. rebuild the engine
Hi, @xiangxinhello , Could you help to provide your /tmp/Qwen/7B/config.json file?
@nv-guomingz for vis
Hi, @zhaocc1106 , could you update to the latest trtllm version?
@zhaocc1106 Could you double check that you have successfully rebuilded and reinstalled the v0.11.0, I think we have remove '--use_custom_all_reduce' knob from build flow and you will get an error...
@zhaocc1106 could you try to enable --context_fmha?