optimum-habana icon indicating copy to clipboard operation
optimum-habana copied to clipboard

Quantization for FSDPA

Open dudilester opened this issue 1 year ago • 1 comments

Added use_flash_attention, flash_attention_causal_mask and flash_attention_recompute to run_lm_eval Enforce recompute flag on fsdpa quantization Allow quantization using HQT Document FusedScaledDotProductAttention quantization

dudilester avatar May 13 '24 10:05 dudilester

Added a commit for documenting the fsdpa quantization changes. This PR includes the below PR commits + the doc commit https://github.com/huggingface/optimum-habana/pull/967 @libinta - the PR should be labeled synapse_1.16_dependency

dudilester avatar May 13 '24 10:05 dudilester

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Should the regression tests used for Llama fp8 be updated? Like here and there for instance?

@regisss I see that sdpa is not tested in bf16 too. it can be added. can you or @libinta take care of it?

MrGeva avatar May 30 '24 14:05 MrGeva