neural-compressor
neural-compressor copied to clipboard
Add support for optimum-habana deepseek v3/r1 fp8 quantization
Type of Change
What does this PR do?
Support FP8 static quantization for optimum-habana deepseek v3/r1 models using Intel Neural Compressor (INC)
This feature needs changes in:
- OH PR https://github.com/huggingface/optimum-habana/pull/1907
Steps for FP8 quantization
# install OH
git clone https://github.com/huggingface/optimum-habana.git
cd optimum-habana
git fetch origin pull/1907/head:deepseek_v3_fp8
git checkout deepseek_v3_fp8
pip install -e .
pip install git+https://github.com/HabanaAI/[email protected]
pip install blobfile tiktoken
# install INC PR with OH deepseek_v3 support
git clone https://github.com/intel/neural-compressor.git
cd neural-compressor
git fetch origin pull/2164/head:oh_ds_r1
git checkout oh_ds_r1
pip uninstall neural_compressor_pt
pip install -r requirements.txt
pip install -r requirements_pt.txt
python setup.py develop pt
# Test FP8 Quantization with moonlight model on 2 cards with expert-parallelism
cd ../optimum-habana/examples/text-generation/
PT_HPU_LAZY_MODE=1 INC_DYNAMIC_MOE_EXPERTS=64 QUANT_CONFIG=quantization_config/maxabs_measure.json python3 ../gaudi_spawn.py --world_size 2 run_generation.py --model_name_or_path moonshotai/Moonlight-16B-A3B --bf16 --trim_logits --batch_size 1 --use_hpu_graphs --use_kv_cache --prompt "DeepSpeed is a machine learning framework" --parallel_strategy "ep" --trust_remote_code_tokenizer
# FP8 dynamic moe op segfaults if SLICE_MAX_EXPERT>32
SLICE_MAX_EXPERT=32 INC_DYNAMIC_MOE_EXPERTS=64 PT_HPU_LAZY_MODE=1 QUANT_CONFIG=quantization_config/maxabs_quant_mixtral.json python3 ../gaudi_spawn.py --world_size 2 run_generation.py --model_name_or_path moonshotai/Moonlight-16B-A3B --bf16 --trim_logits --batch_size 1 --use_hpu_graphs --use_kv_cache --prompt "DeepSpeed is a machine learning framework" --parallel_strategy "ep" --trust_remote_code_tokenizer