Results 2 comments of Arup De

We investigated a critical compatibility issue where `flash_attention_2` doesn't support gpt-oss attention sink, causing gradient norm spikes during training. The existing VERL codebase lacked the ability to override the `attn_implementation`...

@yinzhangyue it didn't implement the backward path with attention-sink support.