llm-awq icon indicating copy to clipboard operation
llm-awq copied to clipboard

Bad result when running AWQ without GPU

Open xin3he opened this issue 2 years ago • 4 comments

Hi, folks, I met some weird issue when reproducing the results shown in paper. I can get results below with GPU visible, but cannot reproduce it with only CPU. I set the dtype as torch.float to avoid lossing precision from float16.

It's not the inference device issue, the difference comes from the awq_results got w/ and w/o GPU. Is there any workaround the handle it? Any suggestions would be helpful, thanks!

To disable GPU: export CUDA_VISIBLE_DEVICES=''

image

opt-125m FP32 group_size INT4 RTN asym on CPU AWQ on CPU AWQ on GPU
wikitext 31.95 G32 33.83 48.52 33.01
G128 35.96 39.53 33.96

xin3he avatar Jul 04 '23 10:07 xin3he

  1. Perform AWQ search and save search results (we already did it for you):
python -m awq.entry --model_path facebook/opt-125m \
    --w_bit 4 --q_group_size 128 \
    --run_awq --dump_awq awq_cache/opt-6.7b-w4-g128.pt
  1. Evaluate the AWQ quantized model on WikiText-2 (simulated pseudo quantization)
python -m awq.entry --model_path facebook/opt-125m \
    --tasks wikitext \
    --w_bit 4 --q_group_size 128 \
    --load_awq awq_cache/opt-6.7b-w4-g128.pt \
    --q_backend fake

xin3he avatar Jul 05 '23 01:07 xin3he

Hi @xin3he,

Thank you for bringing up this issue!

After our tests, we found that running the model using GPU-generated awq search results on a CPU provides expected results. The problem only occurs when trying to use CPU-generated awq search results.

We strongly recommend you to generate the awq search results using a GPU, as running the model using this awq search result on a CPU afterwards won't have any issues. Additionally, performing awq search on a CPU is significantly slower than on a GPU, so we do not advise doing so.

Thanks again for your interest in our work!

Sakits avatar Jul 06 '23 19:07 Sakits

Do you have a guess on why does it happen? Why do you need to run the search on GPU?

ofirzaf avatar Jul 26 '23 08:07 ofirzaf

Do you have a guess on why does it happen? Why do you need to run the search on GPU?

It may be due to different precision on GPU/CPU or how torch executes differently on GPU/CPU

casper-hansen avatar Jul 27 '23 11:07 casper-hansen