ChatGLM-6B chtglm量化后模型推理速度更慢了是什么原因？

chtglm量化后模型推理速度更慢了是什么原因？

Open harleyszhang opened this issue 2 years ago • 1 comments

trafficstars

Is there an existing issue for this?

[X] I have searched the existing issues

Current Behavior

模型量化后推理速度更慢

Expected Behavior

int8 量化速度更快

Steps To Reproduce

直接 run

Environment

- OS: Ubuntu 20.04
- Python:3.8
- Transformers:
- PyTorch:
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :

Anything else?

Apr 20 '23 07:04 harleyszhang

我做了实验，也是类似的结论

Apr 20 '23 17:04 xinsblog

我做了实验，也是类似的结论

感觉像是量化模块功能并没有起作用，我用官方提供的权重能达到和 FP16 相同的速度，但没有加速效果。

Apr 21 '23 03:04 harleyszhang

因为量化的时候只量化了参数，计算还是在 fp16 做的，如果用 INT8 做计算的话准确性损失会很大

Apr 21 '23 06:04 duzx16

因为量化的时候只量化了参数，计算还是在 fp16 做的，如果用 INT8 做计算的话准确性损失会很大

我理解没错的话，就是还需要实现相应的 int8 量化推理层，fasterformer 框架实现了很多层的量化推理层，如下所示。

Apr 21 '23 07:04 harleyszhang

为什么性能下降有结果吗？

May 18 '23 03:05 echoht

在chatglm的量化时，遇到了activation outliers问题

因此chatglm-int8的做法是，只对模型参数进行量化，对activation value（可以理解为中间计算）仍然使用fp16精度

这样一来，确实可以节省显存，但推理速度会降低

May 22 '23 02:05 bulubulu-Li

楼主这个图在哪看的

Jun 19 '23 03:06 rxy1212

楼主这个图在哪看的

我自己做的。

Jun 19 '23 12:06 harleyszhang

这个问题也有可能是推理硬件并没有对int4 int8这些数据类型做优化导致的

Jun 20 '23 02:06 rxy1212

在chatglm的量化时，遇到了activation outliers问题

因此chatglm-int8的做法是，只对模型参数进行量化，对activation value（可以理解为中间计算）仍然使用fp16精度

这样一来，确实可以节省显存，但推理速度会降低

如果中间结果使用fp16精度的话，推理速度不应该和之前fp16差不多吗？

Jul 17 '23 09:07 mynewstart

https://github.com/TimDettmers/bitsandbytes/issues/6

Sep 04 '23 01:09 datalee

ChatGLM-6B ChatGLM-6B copied to clipboard

chtglm量化后模型推理速度更慢了是什么原因？

Is there an existing issue for this?

Current Behavior

Expected Behavior

Steps To Reproduce

Environment

Anything else?

ChatGLM-6B
ChatGLM-6B copied to clipboard