Edoardo Cetin comments

Results 7 comments of


                                            Edoardo Cetin

Fix llama model sdpa attention forward function masking bug when output_attentions=True

A minimal example of this erroneous behavior can be reproduced via: ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model_name = "meta-llama/Meta-Llama-3-8B" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name, device_map='cuda', torch_dtype=torch.bfloat16...

Fix llama model sdpa attention forward function masking bug when output_attentions=True

> Great catch. > > 1. `causal_mask = AttentionMaskConverter._unmask_unattended(causal_mask, min_dtype)` line 1127 needs to be ignored as well. > 2. we need to add your small example script as a...

Fix llama model sdpa attention forward function masking bug when output_attentions=True

> Feel free to rebase it might be fixed on main / be flaky Just did :)

Fix llama model sdpa attention forward function masking bug when output_attentions=True

@ArthurZucker Let me know if you think this fix is ready for merging, or if you'd like to add the tests to the same PR!

Fix llama model sdpa attention forward function masking bug when output_attentions=True

> Would be nice to just add the test in this PR 😉 Alright - I made the addition of output_attentions=True to the sdpa equivalence test, as you suggested ;)...

Fix llama model sdpa attention forward function masking bug when output_attentions=True

@ArthurZucker thanks for your suggestions! I also propagated the same changes to the new jetmoe model. All default checks are now passing ^^

[Frontend] Add option for LLMEngine to return model hidden states.

@jdvin thanks for your patch! +1 this would be a great feature to merge! perhaps @liangfu could help review these?