Megatron-DeepSpeed DeBERTa-like attention mechanism

DeBERTa-like attention mechanism

Open thomasw21 opened this issue 2 years ago • 0 comments

In this issue, we discuss how viable/interesting it might be to implement DeBERTa like attention mechanism:

https://arxiv.org/abs/2006.03654

Things to take in account:

performance enhancements: Check with HF pretrained model to see first?
implementation cost: How much would someone need to spend on implementing that feature?
implementation feasability: It might not work well with Megatron-DeepSpeed setup, we need to check that.

Aug 05 '21 00:08 thomasw21