Arup De
Results
2
comments of
Arup De
We investigated a critical compatibility issue where `flash_attention_2` doesn't support gpt-oss attention sink, causing gradient norm spikes during training. The existing VERL codebase lacked the ability to override the `attn_implementation`...
@yinzhangyue it didn't implement the backward path with attention-sink support.