Jinyan Chen

Results 13 comments of Jinyan Chen

Break PR https://github.com/huggingface/optimum-habana/pull/836 into small pieces, based on PR https://github.com/huggingface/optimum-habana/pull/901

> @jychen-habana , please test rope_scaling with Mixtral and update the results here. **Run with rope_scaling (add below to config.json):** `"rope_scaling": {"type":"linear","factor":2.0},` **Test case: --max_input_tokens 32000 --bucket_size 1024 --max_new_tokens 512...

@regisss @libinta @mandy-li please help review and merge this PR, thanks!

Break PR https://github.com/huggingface/optimum-habana/pull/836 into small pieces, based on PR https://github.com/huggingface/optimum-habana/pull/898

**Add test case input_32000 output_512** **Command (with --limit_hpu_graphs, --reuse_cache, --bucket_internal, --bucket_size 256 --max_new_tokens 512)** ``` QUANT_CONFIG=./quantization_config/maxabs_quant_mixtral.json python run_generation.py --model_name_or_path mistralai/Mixtral-8x7B-v0.1 --use_hpu_graphs --limit_hpu_graphs --use_kv_cache --reuse_cache --bucket_internal --bucket_size 256 --max_new_tokens 512 --bf16...

**Add test case input_32000 output_700** **Command (with --limit_hpu_graphs, --reuse_cache, --bucket_internal, --bucket_size 256 --max_new_tokens 700)** ``` QUANT_CONFIG=./quantization_config/maxabs_quant_mixtral.json python run_generation.py --model_name_or_path mistralai/Mixtral-8x7B-v0.1 --use_hpu_graphs --limit_hpu_graphs --use_kv_cache --reuse_cache --bucket_internal --bucket_size 256 --max_new_tokens 700 --bf16...

close > @jychen-habana Is this PR different from #903 ? #903 is the latest, I will close this PR

> I tested this PR with run_generation.py in 1.16.0 docker. It could fit 30k input tokens but the generated output was empty. Did you check the output? > > input...

> @jychen-habana , as we sync off-line: > > 1. kv_cache_fp8 is the previous way to support fp8 inference which will be removed soon. All the models fp8 inference should...