Qwen2.5
Qwen2.5 copied to clipboard
Recent Qwen1.5-14B-Chat-GPTQ-Int4 quantize details
You have update the weights of Qwen1.5-14B-Chat-GPTQ-Int4
and intermediate_size from 14436 to 14336 about 12 days ago.
It seems that is the int4 version is not quantized directly from Qwen1.5-14B-Chat
for the intermediate_size of Qwen1.5-14B-Chat
is 13696.
Can you explain some details what have you optimized? Why the intermediate_size of GPTQ-Int4 version is not 13696? How can using the quantize method you have applied?
https://huggingface.co/Qwen/Qwen1.5-14B-Chat-GPTQ-Int4/tree/main
There are 3 consective commits and they addressed the issue with tensor parallel in inference. The intermediate size was changed from 13696 to 14336 such that the MLP layers can be dispatched to two or more devices.
The resulting model files should be instantly usable. May I ask what problems have you encountered?
There are 3 consective commits and they addressed the issue with tensor parallel in inference. The intermediate size was changed from 13696 to 14336 such that the MLP layers can be dispatched to two or more devices.
The resulting model files should be instantly usable. May I ask what problems have you encountered?
Thanks for your quick reply. We encountered problems as follows:
We trained qwen1.5-14b-base with pre-train, SFT and DPO with dtype bf16, and then quantize it with auto-gptq to int4 model. The inference engine is vllm.
We can get right result from bf16 model. But when inferencing our trained 14b-gptq-int4 model, nan
may occur in the occasion of the prompts where the output probability is very high. The output are a lot of !!!!!!
.
So If you have anything to share helping fix the problem, we are very appreciated.
If I understand correctly, the intermediate size in your models are still 13696: no need to change it if tensor parallel is not needed.
!
is token id 0
, which is the result if all the outputs are meaningless and strongly suggests broken model checkpoints or incompatible CUDA kernels in vllm. Please check if running with auto-gptq
leads to the same problem to rule out the first. For the latter, what versions of vllm are you using? v0.3.3 is known to be good.
If I understand correctly, the intermediate size in your models are still 13696: no need to change it if tensor parallel is not needed.
!
is token id0
, which is the result if all the outputs are meaningless and strongly suggests broken model checkpoints or incompatible CUDA kernels in vllm. Please check if running withauto-gptq
leads to the same problem to rule out the first. For the latter, what versions of vllm are you using? v0.3.3 is known to be good.
@jklj077
- YES. 13969 would be OK for we do not need tensor parallel.
- The output is not meaningless for the output may begin with some meaningful tokens and ! mainly follows digits like 1 or 2.
- auto-gptq may not be the problem for the result is ok if we inference with
transformers
instead of vllm. - Versions:
- vllm: v0.3.3, v0.4.0, v0.4.0.post1 are all have the problem. We used the pre-built wheel of pyhton3.8 and cuda118 in the repo.
- auto-gpt: v0.7.1
- cuda: v11.8
- torch: 2.1.2
Please also see this comment: https://github.com/QwenLM/Qwen1.5/issues/271#issuecomment-2066214326