GPTQ-for-LLaMa
GPTQ-for-LLaMa copied to clipboard
wbit=16 Conversion Gives Error
When I try to run the quantization pipeline for 16-bit precision,
CUDA_VISIBLE_DEVICES=0 python llama.py ./llama-hf/llama-7b c4 --wbits 16 --true-sequential --act-order --save llama7b-16bit.pt
It raises error that quantizers are not available.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████| 2/2 [01:23<00:00, 41.70s/it]
Found cached dataset json (/home/sawradip/.cache/huggingface/datasets/allenai___json/allenai--c4-6fbe877195f42de5/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
Found cached dataset json (/home/sawradip/.cache/huggingface/datasets/allenai___json/allenai--c4-efc3d4f4606f44bd/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
Traceback (most recent call last):
File "/mnt/c/Users/Sawradip/Desktop/practice_code/practice_llm/GPTQ-for-LLaMa/llama.py", line 480, in <module>
llama_pack(model, quantizers, args.wbits, args.groupsize)
NameError: name 'quantizers' is not defined
The llama.py file defines quantizer only for wbit<16,
if not args.load and args.wbits < 16 and not args.nearest:
tick = time.time()
quantizers = llama_sequential(model, dataloader, DEV)
print(time.time() - tick)
which is expected, because quantizers not needed for 16 bit. But I think this error should be handled in a more elegant way, as we already allow wbit=16.
What's your take on that? @qwopqwop200
16bit is not the original model precision? I guess there is not need to assign the argument wbits for the original model
Write a python script to convert to FP16 from FP32.. don't use GPTQ.