MeJerry215 comments

Results 10 comments of


                                            MeJerry215

[BUG] 如何使用onnxsim处理多输入，overwrite-input-shape

补充说明，这个模型后续在连续两次overwrite-input-shape之后确定了两个输入的shape，但是onnxsim 需要继续执行多次才能无法继续优化下去。目前看目前gpt2 模型當前onnxsim 兩次overwrite-input-shape + 6次的正常sim 最終onnx模型图才没有变化。 ![image](https://user-images.githubusercontent.com/53092165/210720284-cbc4ec62-c819-4d55-bd99-305e2d40fb65.png) @daquexian

Support int8 KVCacheQuant and W8A8 inference in vllm

@AniZpZ smooth quant only get high accuracy with activation **per-token dynamic** quantization, weight per-channel quantization. ![image](https://github.com/vllm-project/vllm/assets/53092165/1c382852-8432-425f-9581-7f50bcd01c77)

Support int8 KVCacheQuant and W8A8 inference in vllm

@AniZpZ the root cause is activation is still too large. you can dump the down proj's input tensor. if you disable llama model down proj quantization, accuracy will get improved...

Support int8 KVCacheQuant and W8A8 inference in vllm

> I have conducted more experiments that achieve the same results as in the paper. > > There is only one problem: per-channel weight quantization is not compatible with the...

How to test inference speed?

so where is the inference code? Its about two months.

en2fr和en2de的模型结构存在差异？

当我导出en2fr的时候推理在加载h5py的时候出错异常，如果我注释掉layernorm 参数

Error assert emb_size % emb_dim == 0

嗨哥们，我不知道为啥他给的这个weight 用起来的时候要减1，从这个导出的脚本来看，他这个emb的size 会大一点在我修改他那个地方的逻辑之后，不知道为什么还是有个地方报错。。 ![image](https://user-images.githubusercontent.com/53092165/205231267-76fcdf46-fc59-4df3-a577-d4d870ec4778.png) 下面这个。所以他的这个Bert 自己的weight，导出来的pb 还不能跑。。。我很迷茫 ![image](https://user-images.githubusercontent.com/53092165/205231322-e4375e12-de2c-483c-b9d2-9a72034242ae.png)

Accuracy drop for Llama

Can u use smooth quant to quant llama without accuracy drop? I try to quant the llama-7b, but accuracy also drops a lot. @fmo-mt ![image](https://github.com/mit-han-lab/smoothquant/assets/53092165/3f540fac-a117-44ab-8ca1-f868ac8b38c7)

Accuracy drop for Llama

@Guangxuan-Xiao

Accuracy drop for Llama

> > Can u use smooth quant to quant llama without accuracy drop? I try to quant the llama-7b, but accuracy also drops a lot. @fmo-mt > > ![image](https://user-images.githubusercontent.com/53092165/247493622-3f540fac-a117-44ab-8ca1-f868ac8b38c7.png) >...