Qwen
Qwen copied to clipboard
[BUG] 关于 int8-kvcache 在 ① 多卡推理 ② 输入长-输出短 两种场景的问题
是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?
- [X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions
该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?
- [X] 我已经搜索过FAQ | I have searched FAQ
当前行为 | Current Behavior
您好,我们参考 Repo README 关闭 flash_attention 并开启了 kvcache,并使用 1.8B 模型进行了验证。
模型加载
模型加载过程如下,运行日志显示 build cpp extension module cache_autogptq_cuda_256
成功了,并从实验结果看到 kvcache 生效,但是有一些其他的问题。
model = AutoModelForCausalLM.from_pretrained(
"hf_model/qwen/Qwen-1_8B-Chat",
device_map="auto",
trust_remote_code=True,
torch_dtype=torch.bfloat16,
use_flash_attn=False,
use_cache_quantization=True,
use_cache_kernel=True,
)
实验结果
我们记录了 w/o int8 kvcache 情况下,同样实验 setting 下显存的占用情况,结果如下。
(1) 单卡正常情况:输入 1 tokens, 输出 2048 tokens
观测结论:
- 开启 int8 kvcache 后,模型推理阶段占用的显存几乎为之前的一半 (578 vs. 1192 MiB)
- 推理速度降低为之前的 58.4% = (44138.43399472535 / 75601.29672177136)
# 模型加载完后显存,尚未推理
[0] NVIDIA A100-SXM4-40GB | 32°C, 0 % | 4007 / 40536 MB |
# 未开启 int8 kvcache
[0] NVIDIA A100-SXM4-40GB | 35°C, 37 % | 5199 / 40536 MB |
P95 latency (ms) - 44138.43399472535; Average latency (ms) - 44071.29 +\- 55.72
# 开启 int8 kvcahe
[0] NVIDIA A100-SXM4-40GB | 36°C, 72 % | 4585 / 40536 MB |
P95 latency (ms) - 75601.29672177136; Average latency (ms) - 75521.62 +\- 65.99
(2) 单卡异常情况:输入 10000 tokens, 输出 1 tokens
观测结论:
- 开启 int8 kvcache 后,模型占用显存比之前更多 14081 > 12717;猜测 kvcahe 并未对输入阶段产生的 tokens 生效
- 推理速度降低为之前的 2.8%;猜测是首字引入的 kvcache 量化逻辑带来的额外计算延时
# 模型加载完后显存,尚未推理
[0] NVIDIA A100-SXM4-40GB | 32°C, 0 % | 4007 / 40536 MB |
# 未开启 int8 kvcache
[0] NVIDIA A100-SXM4-40GB | 33°C, 0 % | 12717 / 40536 MB |
P95 latency (ms) - 502.7519695460796; Average latency (ms) - 501.93 +\- 0.79
# 开启 int8 kvcahe
[0] NVIDIA A100-SXM4-40GB | 38°C, 100 % | 14081 / 40536 MB |
P95 latency (ms) - 17820.346892252564; Average latency (ms) - 17811.62 +\- 7.26
(3) 多卡异常情况:输入 1 tokens, 输出 2048 tokens
此实验自定义了 device_map,将 1.8B 模型加载在了 8 x A100 (40G),为了观测 int8-kvcache 多卡表现情况。
观测结论:
- 开启 int8 kvcache 后,0-3 号卡显存有减少,但不是近乎一半显存节省;4-7 出现了超大的显存占用,有点儿像将所有的 kvcache 复制了4份,分别放到了 4-7 各个 GPU 上
- 推理速度降低为之前的 59.2%
# 模型加载完后显存,尚未推理
[0] NVIDIA A100-SXM4-40GB | 32°C, 0 % | 1213 / 40536 MB |
[1] NVIDIA A100-SXM4-40GB | 31°C, 0 % | 729 / 40536 MB |
[2] NVIDIA A100-SXM4-40GB | 31°C, 0 % | 729 / 40536 MB |
[3] NVIDIA A100-SXM4-40GB | 31°C, 0 % | 729 / 40536 MB |
[4] NVIDIA A100-SXM4-40GB | 34°C, 0 % | 729 / 40536 MB |
[5] NVIDIA A100-SXM4-40GB | 35°C, 0 % | 729 / 40536 MB |
[6] NVIDIA A100-SXM4-40GB | 34°C, 0 % | 729 / 40536 MB |
[7] NVIDIA A100-SXM4-40GB | 35°C, 0 % | 1413 / 40536 MB |
# 未开启 int8 kvcache
[0] NVIDIA A100-SXM4-40GB | 34°C, 5 % | 1437 / 40536 MB |
[1] NVIDIA A100-SXM4-40GB | 33°C, 4 % | 973 / 40536 MB |
[2] NVIDIA A100-SXM4-40GB | 32°C, 4 % | 973 / 40536 MB |
[3] NVIDIA A100-SXM4-40GB | 32°C, 4 % | 973 / 40536 MB |
[4] NVIDIA A100-SXM4-40GB | 36°C, 4 % | 973 / 40536 MB |
[5] NVIDIA A100-SXM4-40GB | 37°C, 4 % | 973 / 40536 MB |
[6] NVIDIA A100-SXM4-40GB | 36°C, 4 % | 973 / 40536 MB |
[7] NVIDIA A100-SXM4-40GB | 37°C, 7 % | 1701 / 40536 MB |
P95 latency (ms) - 56682.5378138572; Average latency (ms) - 56608.21 +\- 80.61
# 开启 int8 kvcache
[0] NVIDIA A100-SXM4-40GB | 32°C, 0 % | 1381 / 40536 MB |
[1] NVIDIA A100-SXM4-40GB | 31°C, 0 % | 897 / 40536 MB |
[2] NVIDIA A100-SXM4-40GB | 31°C, 0 % | 897 / 40536 MB |
[3] NVIDIA A100-SXM4-40GB | 31°C, 0 % | 897 / 40536 MB |
[4] NVIDIA A100-SXM4-40GB | 35°C, 10 % | 6529 / 40536 MB |
[5] NVIDIA A100-SXM4-40GB | 35°C, 6 % | 6561 / 40536 MB |
[6] NVIDIA A100-SXM4-40GB | 35°C, 1 % | 6561 / 40536 MB |
[7] NVIDIA A100-SXM4-40GB | 36°C, 0 % | 7289 / 40536 MB |
P95 latency (ms) - 95769.74254138768; Average latency (ms) - 95504.70 +\- 219.48
(4) 多卡异常情况:输入 10k/20k tokens,输出 1 tokens
观测结论:
- 10k 情况,使用 int8 kvcache 显存占用几乎是之前的 3x = 7719 / 2505
- 20k 情况,报错了
# (1) inp 10k, out 1
[0] NVIDIA A100-SXM4-40GB | 32°C, 0 % | 5697 / 40536 MB |
[1] NVIDIA A100-SXM4-40GB | 31°C, 0 % | 2505 / 40536 MB |
[2] NVIDIA A100-SXM4-40GB | 31°C, 0 % | 2505 / 40536 MB |
[3] NVIDIA A100-SXM4-40GB | 31°C, 0 % | 2505 / 40536 MB |
[4] NVIDIA A100-SXM4-40GB | 35°C, 0 % | 2505 / 40536 MB |
[5] NVIDIA A100-SXM4-40GB | 35°C, 0 % | 2505 / 40536 MB |
[6] NVIDIA A100-SXM4-40GB | 35°C, 0 % | 2505 / 40536 MB |
[7] NVIDIA A100-SXM4-40GB | 36°C, 0 % | 6259 / 40536 MB |
P95 latency (ms) - 632.1799475699663; Average latency (ms) - 618.25 +\- 11.57
# use int8-kvcache
[0] NVIDIA A100-SXM4-40GB | 32°C, 0 % | 8203 / 40536 MB |
[1] NVIDIA A100-SXM4-40GB | 32°C, 0 % | 7719 / 40536 MB |
[2] NVIDIA A100-SXM4-40GB | 32°C, 0 % | 7719 / 40536 MB |
[3] NVIDIA A100-SXM4-40GB | 31°C, 0 % | 7719 / 40536 MB |
[4] NVIDIA A100-SXM4-40GB | 35°C, 0 % | 7719 / 40536 MB |
[5] NVIDIA A100-SXM4-40GB | 36°C, 0 % | 7719 / 40536 MB |
[6] NVIDIA A100-SXM4-40GB | 36°C, 0 % | 7719 / 40536 MB |
[7] NVIDIA A100-SXM4-40GB | 36°C, 0 % | 8385 / 40536 MB |
P95 latency (ms) - 18021.790966019034; Average latency (ms) - 18015.14 +\- 7.24
# (2) inp 20k, out 1
[0] NVIDIA A100-SXM4-40GB | 32°C, 0 % | 11493 / 40536 MB |
[1] NVIDIA A100-SXM4-40GB | 32°C, 0 % | 5191 / 40536 MB |
[2] NVIDIA A100-SXM4-40GB | 32°C, 0 % | 5191 / 40536 MB |
[3] NVIDIA A100-SXM4-40GB | 32°C, 0 % | 5191 / 40536 MB |
[4] NVIDIA A100-SXM4-40GB | 35°C, 0 % | 5191 / 40536 MB |
[5] NVIDIA A100-SXM4-40GB | 36°C, 0 % | 5191 / 40536 MB |
[6] NVIDIA A100-SXM4-40GB | 36°C, 0 % | 5191 / 40536 MB |
[7] NVIDIA A100-SXM4-40GB | 36°C, 0 % | 12055 / 40536 MB |
P95 latency (ms) - 1745.8682876080275; Average latency (ms) - 1631.48 +\- 95.03
# use int8-kvcache
[0] NVIDIA A100-SXM4-40GB | 32°C, 0 % | 27677 / 40536 MB |
...
[7] NVIDIA A100-SXM4-40GB | 36°C, 0 % | 1489 / 40536 MB |
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
期望行为 | Expected Behavior
期望行为
- 多卡 int8 kvcahe 也能正常分配 memory,从而推理更长的 Sequence input/output 组合,比如 10k/2k
- 单卡 int8 kvcache 对于输入长,输出短(比如 10k/1)的组合,也能体现出 memory 节约
- 采用 int8 kvcache 后,内存搬移成本降低,能实现一定地推理速度 & 显存的双赢,最好能不降低推理速度
复现方法 | Steps To Reproduce
No response
运行环境 | Environment
- OS: Ubuntu 22.04.2
- Python: 3.10.6
- Transformers: 4.31.0
- PyTorch: 2.1.0
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`): 12.1
备注 | Anything else?
No response
KV Cache这部分,请 @ZhangJianwei0311 看看有没有comments
This issue has been automatically marked as inactive due to lack of recent activity. Should you believe it remains unresolved and warrants attention, kindly leave a comment on this thread. 此问题由于长期未有新进展而被系统自动标记为不活跃。如果您认为它仍有待解决,请在此帖下方留言以补充信息。