neural-compressor
neural-compressor copied to clipboard
how to evaluate AWQ ?
https://github.com/intel/neural-compressor/blob/master/docs/source/quantization_weight_only.md#examples
how to set eval_func?
https://github.com/intel/neural-compressor/blob/master/examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/weight_only/run_clm_no_trainer.py
it seems no AWQ quantization, just RTN , GPTQ . and as readme.md said, weight-only id fake quantization, why save qmodel (user_model.save(args.output_dir) )?
Hello, @chunniunai220ml Thanks for your interest in Intel(R) Neural Compressor. https://github.com/intel/neural-compressor/blob/master/docs/source/quantization_weight_only.md#examples This document describes the 2. x API. 2.x example link is https://github.com/intel/neural-compressor/tree/master/examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm
Hello, @chunniunai220ml Thanks for your interest in Intel(R) Neural Compressor. https://github.com/intel/neural-compressor/blob/master/docs/source/quantization_weight_only.md#examples This document describes the 2. x API. 2.x example link is https://github.com/intel/neural-compressor/tree/master/examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm
Thank for your reply, i followed 2.x example link , bash script as follow:
python -u run_clm_no_trainer.py
--model $model_path
--dataset ${DATASET_NAME}
--approach weight-only
--output_dir ${tuned_checkpoint}
--quantize
--batch_size ${batch_size}
--woq_algo AWQ
--calib_iters 128
--woq_group_size 128
--woq_bits 4
--tasks hellaswag
--accuracy
https://github.com/intel/neural-compressor/blob/master/examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm/run_clm_no_trainer.py#L355, it seems just evaluate original model instead of qmodel.
if i want to evaluate qmodel, can i just modify #L355 as
q_model.eval()
eval_args = LMEvalParser(
model="hf",
user_model=q_model, #user_model,
tokenizer=tokenizer,
batch_size=args.batch_size,
tasks=args.tasks,}
as readme.md said, Weight-only quantization based on fake quantization, why save qmodel in #L338? i think the qmodel weights dtype is not INT4 in storage. and the run_clm_no_trainer.py only supprt cpu, where is muti-GPU supported codes?
sure, the q_model need to export a compressed model https://github.com/intel/neural-compressor/blob/master/docs/source/quantization_weight_only.md#export-compressed-model
you can refer to https://github.com/intel/intel-extension-for-transformers/tree/v1.5/examples/huggingface/pytorch/text-generation/quantization v1.5 to quantize int4 model, it has integrated this export compressed model. It also includes GPU scripts.
3.x API is stay-tuned.
sure, the q_model need to export a compressed model https://github.com/intel/neural-compressor/blob/master/docs/source/quantization_weight_only.md#export-compressed-model
you can refer to https://github.com/intel/intel-extension-for-transformers/tree/v1.5/examples/huggingface/pytorch/text-generation/quantization v1.5 to quantize int4 model, it has integrated this export compressed model. It also includes GPU scripts.
3.x API is stay-tuned.
does it works well on nvidia V100? the readme,md seems only describe intel-gpu installation
besides, when run on CPU, it's stranged that the codes always killed for no reason after processing several blocks
I suggest you try using 3.x api, q_model is the export compressed model.
We will soon update the example of 3. x, which supports detection of auto-device. https://github.com/intel/neural-compressor/tree/kaihui/woq_3x_eg But we haven't tested the performance on nv GPUs.
on dev branch: https://github.com/intel/neural-compressor/tree/kaihui/woq_3x_eg/examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/weight_only
I suggest you try using 3.x api, q_model is the export compressed model.
We will soon update the example of 3. x, which supports detection of auto-device. https://github.com/intel/neural-compressor/tree/kaihui/woq_3x_eg But we haven't tested the performance on nv GPUs.
on dev branch: https://github.com/intel/neural-compressor/tree/kaihui/woq_3x_eg/examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/weight_only
i git kaihui/woq_3x_eg branch , and run :
CUDA_VISIBLE_DEVICES="2" python run_clm_no_trainer.py
--model $model_path
--woq_algo AWQ
--woq_bits 4
--woq_group_size 128
--calib_iters 128
--woq_scheme asym
--quantize
--batch_size 1
--tasks wikitext
--accuracy
AutoModelForCausalLM.from_pretrained(debice='cuda')
neural-compressor/neural_compressor/torch/algorithms/weight_only/awq.py line 240, in block_calibration:
model(*args, **kwargs),the inputs device is cpu, so bug reported:
: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
but another bug in eval:
from intel_extension_for_transformers.transformers.llm.evaluation.lm_eval import evaluate, LMEvalParser
File "/*/anaconda3/lib/python3.11/site-packages/intel_extension_for_transformers/transformers/init.py", line 19, in
and, how to load saved_results/quantmodel.pt to evaluate?
Hi, @chunniunai220ml, try with the old version like 2.6 may solve this issue:
ModuleNotFoundError: No module named 'neural_compressor.conf'
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days.
This issue was closed because it has been stalled for 7 days with no activity.