optimum-habana icon indicating copy to clipboard operation
optimum-habana copied to clipboard

Quantization failed

Open endomorphosis opened this issue 6 months ago • 6 comments

System Info

The examples provided do not work correctly, I think there has been updates in the intel neural compressor toolkit, which is now 3.0. and the habana quantization toolkit, and the documentation is out of date, I will look into fixing this on my own in the meanwhile.

I did run the neural compressor toolkit 2.4.1 and got some config files from it, I have not grokked the entire habana stack and am just trying to work my way through different packages, so I can get an idea of how it all works together as a unified hole.

https://github.com/endomorphosis/optimum-habana/tree/main/examples/text-generation

root@c6a6613a6f4c:~/optimum-habana/examples/text-generation#   USE_INC=0  QUANT_CONFIG=./quantization_config/maxabs_quant.json TQDM_DISABLE=1 python run_generation.py --model_name_or_path meta-llama/Meta-Llama-3.1-70B-Instruct --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --limit_hpu_graphs --bucket_size=128 --bucket_internal --max_new_tokens 2048 --max_input_tokens 2048 --bf16 --batch_size 1 --disk_offload --use_flash_attention --flash_attention_recompute

/usr/local/lib/python3.10/dist-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
08/11/2024 03:47:15 - INFO - __main__ - Single-device run.
08/11/2024 03:47:32 - WARNING - accelerate.big_modeling - Some parameters are on the meta device device because they were offloaded to the cpu and disk.
QUANT PACKAGE: Loading ./quantization_config/maxabs_quant.json
HQT Git revision =  16.0.526

HQT Configuration =  Fp8cfg(cfg={'dump_stats_path': './hqt_output/measure', 'fp8_config': torch.float8_e4m3fn, 'hp_dtype': torch.bfloat16, 'blocklist': {'names': [], 'types': []}, 'allowlist': {'names': [], 'types': []}, 'mode': <QuantMode.QUANTIZE: 1>, 'scale_method': <ScaleMethod.MAXABS_HW: 4>, 'scale_params': {}, 'observer': 'maxabs', 'mod_dict': {'Matmul': 'matmul', 'Linear': 'linear', 'FalconLinear': 'linear', 'KVCache': 'kv_cache', 'Conv2d': 'linear', 'LoRACompatibleLinear': 'linear', 'LoRACompatibleConv': 'linear', 'Softmax': 'softmax', 'ModuleFusedSDPA': 'fused_sdpa', 'LinearLayer': 'linear', 'LinearAllreduce': 'linear', 'ScopedLinearAllReduce': 'linear', 'LmHeadLinearAllreduce': 'linear'}, 'local_rank': None, 'global_rank': None, 'world_size': 1, 'seperate_measure_files': True, 'verbose': False, 'device_type': 4, 'measure_exclude': <MeasureExclude.OUTPUT: 4>, 'method': 'HOOKS', 'dump_stats_base_path': './hqt_output/', 'shape_file': './hqt_output/measure_hooks_shape', 'scale_file': './hqt_output/measure_hooks_maxabs_MAXABS_HW', 'measure_file': './hqt_output/measure_hooks_maxabs'})

Total modules : 961
Traceback (most recent call last):
  File "/root/optimum-habana/examples/text-generation/run_generation.py", line 692, in <module>
    main()
  File "/root/optimum-habana/examples/text-generation/run_generation.py", line 337, in main
    model, assistant_model, tokenizer, generation_config = initialize_model(args, logger)
  File "/root/optimum-habana/examples/text-generation/utils.py", line 633, in initialize_model
    setup_model(args, model_dtype, model_kwargs, logger)
  File "/root/optimum-habana/examples/text-generation/utils.py", line 265, in setup_model
    model = setup_quantization(model, args)
  File "/root/optimum-habana/examples/text-generation/utils.py", line 206, in setup_quantization
    habana_quantization_toolkit.prep_model(model)
  File "/usr/local/lib/python3.10/dist-packages/habana_quantization_toolkit/prepare_quant/prepare_model.py", line 34, in prep_model
    return _prep_model_with_predefined_config(model, config=config)
  File "/usr/local/lib/python3.10/dist-packages/habana_quantization_toolkit/prepare_quant/prepare_model.py", line 14, in _prep_model_with_predefined_config
    prepare_model(model)
  File "/usr/local/lib/python3.10/dist-packages/habana_quantization_toolkit/_core/__init__.py", line 57, in prepare_model
    return quantize(model, mod_list)
  File "/usr/local/lib/python3.10/dist-packages/habana_quantization_toolkit/_core/quantize.py", line 62, in quantize
    measurement=load_measurements(model, config.cfg['measure_file'])
  File "/usr/local/lib/python3.10/dist-packages/habana_quantization_toolkit/_core/measure.py", line 136, in load_measurements
    d = load_file(fname_np, np.ndarray, fail_on_file_not_exist=config['scale_method'] not in [ScaleMethod.WITHOUT_SCALE, ScaleMethod.UNIT_SCALE])
  File "/usr/local/lib/python3.10/dist-packages/habana_quantization_toolkit/_core/common.py", line 106, in load_file
    raise FileNotFoundError(f"Failed to load file {fname}")
FileNotFoundError: Failed to load file ./hqt_output/measure_hooks_maxabs.npz

Information

  • [X] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

QUANT_CONFIG=./quantization_config/maxabs_quant.json TQDM_DISABLE=1 python run_generation.py --model_name_or_path meta-llama/Meta-Llama-3.1-70B-Instruct --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --limit_hpu_graphs --bucket_size=128 --bucket_internal --max_new_tokens 2048 --max_input_tokens 2048 --bf16 --batch_size 1 --disk_offload --use_flash_attention --flash_attention_recompute

Expected behavior

trying to use quantized llama 3.1 70b models

endomorphosis avatar Aug 11 '24 03:08 endomorphosis