TensorRT-LLM
TensorRT-LLM copied to clipboard
Support use_cache=False
Demand background:
In my case, I only need the LLM to generate one token.
I think setting use_cache=True has no effect on speed, but it will obviously increase the memory usage.
So I set use_cache=Fasle here:
https://github.com/NVIDIA/TensorRT-LLM/blob/850b6fa1e710d25769f2b560d897d2bd424a645e/tensorrt_llm/builder.py#L681
but the code will report an error.
So I was wondering if you have any plans to support use_cache=Fasle?
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."
@QiJune @byshiue Is there any development plan for this issue?
Hi @Hukongtao , I am working on this, please let me know your model such I could verify against.
Hi @Hukongtao , I am working on this, please let me know your model such I could verify against.
qwen
Hi @Hukongtao , I am working on this, please let me know your model such I could verify against.
qwen
Got it!
Hi @Hukongtao , I am working on this, please let me know your model such I could verify against.
qwen
Got it!
Hello, Qwen2 can support use_cache=False?
Hi @Hukongtao , we are actively working on this. But it won't catch up next release due to impact of t his beyond our expectation. Please expect it will be officially included in 0.12 release.
Hi @Hukongtao , we are actively working on this. But it won't catch up next release due to impact of t his beyond our expectation. Please expect it will be officially included in 0.12 release.
Looking forward to your release.
Hi, I'm also looking to disable the KV cache completely as my use-case requires only the first token generation.
The only work-around so far has been to set max_tokens_in_paged_kv_cache during inference to the minimum possible number (64)
Is there any ballpark number of how much this could potentially improve the TTFT? I'm using mosaicml/mpt-7b-8k - Please do verify against it!
Hi, I'm also looking to disable the KV cache completely as my use-case requires only the first token generation.
The only work-around so far has been to set
max_tokens_in_paged_kv_cacheduring inference to the minimum possible number (64)Is there any ballpark number of how much this could potentially improve the TTFT? I'm using
mosaicml/mpt-7b-8k- Please do verify against it!
HI Saad: This feature is working in progress. I can't see any smooth WAR for disable kv_cache at this moment. It is targeting 0.12 release.
Is this feature supported? https://github.com/NVIDIA/TensorRT-LLM/releases/tag/v0.12.0
@Hukongtao Do you still have the question? If not, we will close it soon.