llm-awq
llm-awq copied to clipboard
Bad result when running AWQ without GPU
Hi, folks, I met some weird issue when reproducing the results shown in paper. I can get results below with GPU visible, but cannot reproduce it with only CPU. I set the dtype as torch.float to avoid lossing precision from float16.
It's not the inference device issue, the difference comes from the awq_results got w/ and w/o GPU. Is there any workaround the handle it? Any suggestions would be helpful, thanks!
To disable GPU: export CUDA_VISIBLE_DEVICES=''
| opt-125m | FP32 | group_size | INT4 RTN asym on CPU | AWQ on CPU | AWQ on GPU |
|---|---|---|---|---|---|
| wikitext | 31.95 | G32 | 33.83 | 48.52 | 33.01 |
| G128 | 35.96 | 39.53 | 33.96 |
- Perform AWQ search and save search results (we already did it for you):
python -m awq.entry --model_path facebook/opt-125m \
--w_bit 4 --q_group_size 128 \
--run_awq --dump_awq awq_cache/opt-6.7b-w4-g128.pt
- Evaluate the AWQ quantized model on WikiText-2 (simulated pseudo quantization)
python -m awq.entry --model_path facebook/opt-125m \
--tasks wikitext \
--w_bit 4 --q_group_size 128 \
--load_awq awq_cache/opt-6.7b-w4-g128.pt \
--q_backend fake
Hi @xin3he,
Thank you for bringing up this issue!
After our tests, we found that running the model using GPU-generated awq search results on a CPU provides expected results. The problem only occurs when trying to use CPU-generated awq search results.
We strongly recommend you to generate the awq search results using a GPU, as running the model using this awq search result on a CPU afterwards won't have any issues. Additionally, performing awq search on a CPU is significantly slower than on a GPU, so we do not advise doing so.
Thanks again for your interest in our work!
Do you have a guess on why does it happen? Why do you need to run the search on GPU?
Do you have a guess on why does it happen? Why do you need to run the search on GPU?
It may be due to different precision on GPU/CPU or how torch executes differently on GPU/CPU