Qwen2.5 icon indicating copy to clipboard operation
Qwen2.5 copied to clipboard

Recent Qwen1.5-14B-Chat-GPTQ-Int4 quantize details

Open Huarong opened this issue 10 months ago • 5 comments

You have update the weights of Qwen1.5-14B-Chat-GPTQ-Int4 and intermediate_size from 14436 to 14336 about 12 days ago.

It seems that is the int4 version is not quantized directly from Qwen1.5-14B-Chat for the intermediate_size of Qwen1.5-14B-Chat is 13696.

Can you explain some details what have you optimized? Why the intermediate_size of GPTQ-Int4 version is not 13696? How can using the quantize method you have applied?

https://huggingface.co/Qwen/Qwen1.5-14B-Chat-GPTQ-Int4/tree/main image

image

Huarong avatar Apr 07 '24 02:04 Huarong

There are 3 consective commits and they addressed the issue with tensor parallel in inference. The intermediate size was changed from 13696 to 14336 such that the MLP layers can be dispatched to two or more devices.

The resulting model files should be instantly usable. May I ask what problems have you encountered?

jklj077 avatar Apr 08 '24 04:04 jklj077

There are 3 consective commits and they addressed the issue with tensor parallel in inference. The intermediate size was changed from 13696 to 14336 such that the MLP layers can be dispatched to two or more devices.

The resulting model files should be instantly usable. May I ask what problems have you encountered?

Thanks for your quick reply. We encountered problems as follows:

We trained qwen1.5-14b-base with pre-train, SFT and DPO with dtype bf16, and then quantize it with auto-gptq to int4 model. The inference engine is vllm.

We can get right result from bf16 model. But when inferencing our trained 14b-gptq-int4 model, nan may occur in the occasion of the prompts where the output probability is very high. The output are a lot of !!!!!!.

So If you have anything to share helping fix the problem, we are very appreciated.

Huarong avatar Apr 08 '24 11:04 Huarong

If I understand correctly, the intermediate size in your models are still 13696: no need to change it if tensor parallel is not needed.

! is token id 0, which is the result if all the outputs are meaningless and strongly suggests broken model checkpoints or incompatible CUDA kernels in vllm. Please check if running with auto-gptq leads to the same problem to rule out the first. For the latter, what versions of vllm are you using? v0.3.3 is known to be good.

jklj077 avatar Apr 08 '24 12:04 jklj077

If I understand correctly, the intermediate size in your models are still 13696: no need to change it if tensor parallel is not needed.

! is token id 0, which is the result if all the outputs are meaningless and strongly suggests broken model checkpoints or incompatible CUDA kernels in vllm. Please check if running with auto-gptq leads to the same problem to rule out the first. For the latter, what versions of vllm are you using? v0.3.3 is known to be good.

@jklj077

  1. YES. 13969 would be OK for we do not need tensor parallel.
  2. The output is not meaningless for the output may begin with some meaningful tokens and ! mainly follows digits like 1 or 2.
  3. auto-gptq may not be the problem for the result is ok if we inference with transformers instead of vllm.
  4. Versions:
  • vllm: v0.3.3, v0.4.0, v0.4.0.post1 are all have the problem. We used the pre-built wheel of pyhton3.8 and cuda118 in the repo.
  • auto-gpt: v0.7.1
  • cuda: v11.8
  • torch: 2.1.2

Huarong avatar Apr 09 '24 06:04 Huarong