Add support for optimum-habana deepseek v3/r1 fp8 quantization

Open skavulya opened this issue 8 months ago • 0 comments

Type of Change

What does this PR do?

Support FP8 static quantization for optimum-habana deepseek v3/r1 models using Intel Neural Compressor (INC)

This feature needs changes in:

OH PR https://github.com/huggingface/optimum-habana/pull/1907

Steps for FP8 quantization

# install OH
git clone https://github.com/huggingface/optimum-habana.git
cd optimum-habana
git fetch origin pull/1907/head:deepseek_v3_fp8
git checkout deepseek_v3_fp8
pip install -e .
pip install git+https://github.com/HabanaAI/[email protected]
pip install blobfile tiktoken

# install INC PR with OH deepseek_v3 support
git clone https://github.com/intel/neural-compressor.git
cd neural-compressor
git fetch origin pull/2164/head:oh_ds_r1
git checkout oh_ds_r1 
pip uninstall neural_compressor_pt
pip install -r requirements.txt
pip install -r requirements_pt.txt
python setup.py develop pt

# Test FP8 Quantization with moonlight model on 2 cards with expert-parallelism
cd ../optimum-habana/examples/text-generation/
PT_HPU_LAZY_MODE=1 INC_DYNAMIC_MOE_EXPERTS=64 QUANT_CONFIG=quantization_config/maxabs_measure.json python3 ../gaudi_spawn.py  --world_size 2 run_generation.py --model_name_or_path moonshotai/Moonlight-16B-A3B --bf16 --trim_logits --batch_size 1 --use_hpu_graphs --use_kv_cache  --prompt "DeepSpeed is a machine learning framework"  --parallel_strategy "ep" --trust_remote_code_tokenizer

# FP8 dynamic moe op segfaults if SLICE_MAX_EXPERT>32
 SLICE_MAX_EXPERT=32 INC_DYNAMIC_MOE_EXPERTS=64 PT_HPU_LAZY_MODE=1 QUANT_CONFIG=quantization_config/maxabs_quant_mixtral.json python3 ../gaudi_spawn.py  --world_size 2 run_generation.py --model_name_or_path moonshotai/Moonlight-16B-A3B --bf16 --trim_logits --batch_size 1 --use_hpu_graphs --use_kv_cache  --prompt "DeepSpeed is a machine learning framework"  --parallel_strategy "ep" --trust_remote_code_tokenizer

Apr 04 '25 00:04 skavulya

neural-compressor neural-compressor copied to clipboard

Add support for optimum-habana deepseek v3/r1 fp8 quantization

Type of Change

What does this PR do?

neural-compressor
neural-compressor copied to clipboard