TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

Support use_cache=False

Open Hukongtao opened this issue 1 year ago • 8 comments

Demand background: In my case, I only need the LLM to generate one token. I think setting use_cache=True has no effect on speed, but it will obviously increase the memory usage. So I set use_cache=Fasle here: https://github.com/NVIDIA/TensorRT-LLM/blob/850b6fa1e710d25769f2b560d897d2bd424a645e/tensorrt_llm/builder.py#L681 but the code will report an error. So I was wondering if you have any plans to support use_cache=Fasle?

Hukongtao avatar Apr 08 '24 13:04 Hukongtao

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

github-actions[bot] avatar May 16 '24 01:05 github-actions[bot]

@QiJune @byshiue Is there any development plan for this issue?

Hukongtao avatar May 16 '24 11:05 Hukongtao

Hi @Hukongtao , I am working on this, please let me know your model such I could verify against.

joyang-nv avatar Jun 11 '24 09:06 joyang-nv

Hi @Hukongtao , I am working on this, please let me know your model such I could verify against.

qwen

Hukongtao avatar Jun 13 '24 07:06 Hukongtao

Hi @Hukongtao , I am working on this, please let me know your model such I could verify against.

qwen

Got it!

joyang-nv avatar Jun 13 '24 07:06 joyang-nv

Hi @Hukongtao , I am working on this, please let me know your model such I could verify against.

qwen

Got it!

Hello, Qwen2 can support use_cache=False?

Hukongtao avatar Jul 10 '24 13:07 Hukongtao

Hi @Hukongtao , we are actively working on this. But it won't catch up next release due to impact of t his beyond our expectation. Please expect it will be officially included in 0.12 release.

joyang-nv avatar Jul 11 '24 07:07 joyang-nv

Hi @Hukongtao , we are actively working on this. But it won't catch up next release due to impact of t his beyond our expectation. Please expect it will be officially included in 0.12 release.

Looking forward to your release.

Hukongtao avatar Jul 11 '24 09:07 Hukongtao

Hi, I'm also looking to disable the KV cache completely as my use-case requires only the first token generation.

The only work-around so far has been to set max_tokens_in_paged_kv_cache during inference to the minimum possible number (64)

Is there any ballpark number of how much this could potentially improve the TTFT? I'm using mosaicml/mpt-7b-8k - Please do verify against it!

SaadKaleem avatar Jul 21 '24 22:07 SaadKaleem

Hi, I'm also looking to disable the KV cache completely as my use-case requires only the first token generation.

The only work-around so far has been to set max_tokens_in_paged_kv_cache during inference to the minimum possible number (64)

Is there any ballpark number of how much this could potentially improve the TTFT? I'm using mosaicml/mpt-7b-8k - Please do verify against it!

HI Saad: This feature is working in progress. I can't see any smooth WAR for disable kv_cache at this moment. It is targeting 0.12 release.

joyang-nv avatar Jul 22 '24 02:07 joyang-nv

Is this feature supported? https://github.com/NVIDIA/TensorRT-LLM/releases/tag/v0.12.0

Hukongtao avatar Sep 03 '24 05:09 Hukongtao

@Hukongtao Do you still have the question? If not, we will close it soon.

hello-11 avatar Nov 14 '24 08:11 hello-11