llm-awq icon indicating copy to clipboard operation
llm-awq copied to clipboard

Latency Improve

Open kyrie2to11 opened this issue 1 year ago • 1 comments

Question

Hi there, thanks for your great work!

I'm a beginner in quantization, and I ran the example usage script on llama-2-7b according to README.md. I noticed that the evaluation process for fake quantization (00:40) is faster than real quantization (01:21). I'm curious about what might be causing this or this phenomenon is normal. Intuitively, I expected the real quantized version to be faster. Below are the scripts and logs. Any hints would be appreciated. Thanks!

test scripts

# evaluate the AWQ quantize model (simulated pseudo quantization)
$env/python -m awq.entry --model_path /mnt/afs/user/liyawei/projects/llama/$MODEL \
    --tasks wikitext \
    --w_bit 4 --q_group_size 128 \
    --load_awq awq_cache/$MODEL-w4-g128.pt \
    --q_backend fake

# load and evaluate the real quantized model (smaller gpu memory usage)
$env/python -m awq.entry --model_path /mnt/afs/user/jarvis/projects/llama/$MODEL \
    --tasks wikitext \
    --w_bit 4 --q_group_size 128 \
    --load_quant quant_cache/$MODEL-w4-g128-awq.pt

output logs

pt-7eelyeph-worker-0 logs: PATH = /mnt/afs/user/jarvis/.scc/bin:/usr/local/lib/miniconda3/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
pt-7eelyeph-worker-0 logs: LD_LIBRARY_PATH = /usr/local/cuda-11.8/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
pt-7eelyeph-worker-0 logs: dir = /mnt/afs/user/jarvis/projects/llm-awq_test
pt-7eelyeph-worker-0 logs: env = /mnt/afs/user/jarvis/.conda/envs/lion_awq/bin
pt-7eelyeph-worker-0 logs: [2024-01-12 15:41:30,358] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
pt-7eelyeph-worker-0 logs: Quantization config: {'zero_point': True, 'q_group_size': 128}
pt-7eelyeph-worker-0 logs: * Building model /mnt/afs/user/jarvis/projects/llama/llama-2-7b-hf
Loading checkpoint shards: 100%|██████████| 3/3 [00:37<00:00, 12.41s/it]
pt-7eelyeph-worker-0 logs: Loading pre-computed AWQ results from awq_cache/llama-2-7b-hf-w4-g128.pt
pseudo weight quantization...: 100%|██████████| 32/32 [00:04<00:00,  6.96it/s]
Downloading readme: 100%|██████████| 10.5k/10.5k [00:00<00:00, 52.3MB/s]
Downloading data: 100%|██████████| 733k/733k [00:00<00:00, 830kB/s]
Downloading data: 100%|██████████| 6.36M/6.36M [00:02<00:00, 2.77MB/s]
Downloading data: 100%|██████████| 657k/657k [00:00<00:00, 950kB/s]t]]
Downloading data files: 100%|██████████| 3/3 [00:03<00:00,  1.29s/it]
Extracting data files: 100%|██████████| 3/3 [00:00<00:00, 3112.27it/s]
Generating test split: 100%|██████████| 4358/4358 [00:00<00:00, 132771.44 examples/s]
Generating train split: 100%|██████████| 36718/36718 [00:00<00:00, 1145600.07 examples/s]
Generating validation split: 100%|██████████| 3760/3760 [00:00<00:00, 1030487.65 examples/s]
evaluating...: 100%|██████████| 166/166 [00:40<00:00,  4.08it/s]
pt-7eelyeph-worker-0 logs: Current GPU memory usage: 12.84 GB
pt-7eelyeph-worker-0 logs: Peak GPU memory usage: 14.91 GB
pt-7eelyeph-worker-0 logs: 5.610528945922852
pt-7eelyeph-worker-0 logs: [2024-01-12 15:43:41,536] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
pt-7eelyeph-worker-0 logs: Quantization config: {'zero_point': True, 'q_group_size': 128}
pt-7eelyeph-worker-0 logs: * Building model /mnt/afs/user/liyawei/projects/llama/llama-2-7b-hf
pt-7eelyeph-worker-0 logs: Loading pre-computed quantized weights...
real weight quantization...(init only): 100%|██████████| 32/32 [00:00<00:00, 996.29it/s]
evaluating...: 100%|██████████| 166/166 [01:21<00:00,  2.03it/s]
pt-7eelyeph-worker-0 logs: Current GPU memory usage: 3.96 GB
pt-7eelyeph-worker-0 logs: Peak GPU memory usage: 6.03 GB
pt-7eelyeph-worker-0 logs: 5.610483646392822

kyrie2to11 avatar Jan 12 '24 08:01 kyrie2to11