TensorRT-LLM Support use

Demand background： In my case, I only need the LLM to generate one token. I think setting use_cache=True has no effect on speed, but it will obviously increase the memory usage. So I set use_cache=Fasle here: https://github.com/NVIDIA/TensorRT-LLM/blob/850b6fa1e710d25769f2b560d897d2bd424a645e/tensorrt_llm/builder.py#L681 but the code will report an error. So I was wondering if you have any plans to support use_cache=Fasle?

Apr 08 '24 13:04 Hukongtao

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

May 16 '24 01:05 github-actions[bot]

@QiJune @byshiue Is there any development plan for this issue?

May 16 '24 11:05 Hukongtao

Hi @Hukongtao , I am working on this, please let me know your model such I could verify against.

Jun 11 '24 09:06 joyang-nv

Hi @Hukongtao , I am working on this, please let me know your model such I could verify against.

qwen

Jun 13 '24 07:06 Hukongtao

Hi @Hukongtao , I am working on this, please let me know your model such I could verify against.

qwen

Got it!

Jun 13 '24 07:06 joyang-nv

Hi @Hukongtao , I am working on this, please let me know your model such I could verify against.

qwen

Got it!

Hello, Qwen2 can support use_cache=False?

Jul 10 '24 13:07 Hukongtao

Hi @Hukongtao , we are actively working on this. But it won't catch up next release due to impact of t his beyond our expectation. Please expect it will be officially included in 0.12 release.

Jul 11 '24 07:07 joyang-nv

Hi @Hukongtao , we are actively working on this. But it won't catch up next release due to impact of t his beyond our expectation. Please expect it will be officially included in 0.12 release.

Looking forward to your release.

Jul 11 '24 09:07 Hukongtao

Hi, I'm also looking to disable the KV cache completely as my use-case requires only the first token generation.

The only work-around so far has been to set max_tokens_in_paged_kv_cache during inference to the minimum possible number (64)

Is there any ballpark number of how much this could potentially improve the TTFT? I'm using mosaicml/mpt-7b-8k - Please do verify against it!

Jul 21 '24 22:07 SaadKaleem

Hi, I'm also looking to disable the KV cache completely as my use-case requires only the first token generation.

The only work-around so far has been to set max_tokens_in_paged_kv_cache during inference to the minimum possible number (64)

Is there any ballpark number of how much this could potentially improve the TTFT? I'm using mosaicml/mpt-7b-8k - Please do verify against it!

HI Saad: This feature is working in progress. I can't see any smooth WAR for disable kv_cache at this moment. It is targeting 0.12 release.

Jul 22 '24 02:07 joyang-nv

Is this feature supported? https://github.com/NVIDIA/TensorRT-LLM/releases/tag/v0.12.0

Sep 03 '24 05:09 Hukongtao

@Hukongtao Do you still have the question? If not, we will close it soon.

Nov 14 '24 08:11 hello-11

TensorRT-LLM
TensorRT-LLM copied to clipboard

Support use_cache=False

TensorRT-LLM TensorRT-LLM copied to clipboard

Support use_cache=False

TensorRT-LLM
TensorRT-LLM copied to clipboard