optimum-habana
optimum-habana copied to clipboard
Quantization failed
System Info
The examples provided do not work correctly, I think there has been updates in the intel neural compressor toolkit, which is now 3.0. and the habana quantization toolkit, and the documentation is out of date, I will look into fixing this on my own in the meanwhile.
I did run the neural compressor toolkit 2.4.1 and got some config files from it, I have not grokked the entire habana stack and am just trying to work my way through different packages, so I can get an idea of how it all works together as a unified hole.
https://github.com/endomorphosis/optimum-habana/tree/main/examples/text-generation
root@c6a6613a6f4c:~/optimum-habana/examples/text-generation# USE_INC=0 QUANT_CONFIG=./quantization_config/maxabs_quant.json TQDM_DISABLE=1 python run_generation.py --model_name_or_path meta-llama/Meta-Llama-3.1-70B-Instruct --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --limit_hpu_graphs --bucket_size=128 --bucket_internal --max_new_tokens 2048 --max_input_tokens 2048 --bf16 --batch_size 1 --disk_offload --use_flash_attention --flash_attention_recompute
/usr/local/lib/python3.10/dist-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
08/11/2024 03:47:15 - INFO - __main__ - Single-device run.
08/11/2024 03:47:32 - WARNING - accelerate.big_modeling - Some parameters are on the meta device device because they were offloaded to the cpu and disk.
QUANT PACKAGE: Loading ./quantization_config/maxabs_quant.json
HQT Git revision = 16.0.526
HQT Configuration = Fp8cfg(cfg={'dump_stats_path': './hqt_output/measure', 'fp8_config': torch.float8_e4m3fn, 'hp_dtype': torch.bfloat16, 'blocklist': {'names': [], 'types': []}, 'allowlist': {'names': [], 'types': []}, 'mode': <QuantMode.QUANTIZE: 1>, 'scale_method': <ScaleMethod.MAXABS_HW: 4>, 'scale_params': {}, 'observer': 'maxabs', 'mod_dict': {'Matmul': 'matmul', 'Linear': 'linear', 'FalconLinear': 'linear', 'KVCache': 'kv_cache', 'Conv2d': 'linear', 'LoRACompatibleLinear': 'linear', 'LoRACompatibleConv': 'linear', 'Softmax': 'softmax', 'ModuleFusedSDPA': 'fused_sdpa', 'LinearLayer': 'linear', 'LinearAllreduce': 'linear', 'ScopedLinearAllReduce': 'linear', 'LmHeadLinearAllreduce': 'linear'}, 'local_rank': None, 'global_rank': None, 'world_size': 1, 'seperate_measure_files': True, 'verbose': False, 'device_type': 4, 'measure_exclude': <MeasureExclude.OUTPUT: 4>, 'method': 'HOOKS', 'dump_stats_base_path': './hqt_output/', 'shape_file': './hqt_output/measure_hooks_shape', 'scale_file': './hqt_output/measure_hooks_maxabs_MAXABS_HW', 'measure_file': './hqt_output/measure_hooks_maxabs'})
Total modules : 961
Traceback (most recent call last):
File "/root/optimum-habana/examples/text-generation/run_generation.py", line 692, in <module>
main()
File "/root/optimum-habana/examples/text-generation/run_generation.py", line 337, in main
model, assistant_model, tokenizer, generation_config = initialize_model(args, logger)
File "/root/optimum-habana/examples/text-generation/utils.py", line 633, in initialize_model
setup_model(args, model_dtype, model_kwargs, logger)
File "/root/optimum-habana/examples/text-generation/utils.py", line 265, in setup_model
model = setup_quantization(model, args)
File "/root/optimum-habana/examples/text-generation/utils.py", line 206, in setup_quantization
habana_quantization_toolkit.prep_model(model)
File "/usr/local/lib/python3.10/dist-packages/habana_quantization_toolkit/prepare_quant/prepare_model.py", line 34, in prep_model
return _prep_model_with_predefined_config(model, config=config)
File "/usr/local/lib/python3.10/dist-packages/habana_quantization_toolkit/prepare_quant/prepare_model.py", line 14, in _prep_model_with_predefined_config
prepare_model(model)
File "/usr/local/lib/python3.10/dist-packages/habana_quantization_toolkit/_core/__init__.py", line 57, in prepare_model
return quantize(model, mod_list)
File "/usr/local/lib/python3.10/dist-packages/habana_quantization_toolkit/_core/quantize.py", line 62, in quantize
measurement=load_measurements(model, config.cfg['measure_file'])
File "/usr/local/lib/python3.10/dist-packages/habana_quantization_toolkit/_core/measure.py", line 136, in load_measurements
d = load_file(fname_np, np.ndarray, fail_on_file_not_exist=config['scale_method'] not in [ScaleMethod.WITHOUT_SCALE, ScaleMethod.UNIT_SCALE])
File "/usr/local/lib/python3.10/dist-packages/habana_quantization_toolkit/_core/common.py", line 106, in load_file
raise FileNotFoundError(f"Failed to load file {fname}")
FileNotFoundError: Failed to load file ./hqt_output/measure_hooks_maxabs.npz
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [X] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
QUANT_CONFIG=./quantization_config/maxabs_quant.json TQDM_DISABLE=1 python run_generation.py --model_name_or_path meta-llama/Meta-Llama-3.1-70B-Instruct --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --limit_hpu_graphs --bucket_size=128 --bucket_internal --max_new_tokens 2048 --max_input_tokens 2048 --bf16 --batch_size 1 --disk_offload --use_flash_attention --flash_attention_recompute
Expected behavior
trying to use quantized llama 3.1 70b models