Megatron-DeepSpeed
Megatron-DeepSpeed copied to clipboard
DeBERTa-like attention mechanism
In this issue, we discuss how viable/interesting it might be to implement DeBERTa like attention mechanism:
https://arxiv.org/abs/2006.03654
Things to take in account:
- performance enhancements: Check with HF pretrained model to see first?
- implementation cost: How much would someone need to spend on implementing that feature?
- implementation feasability: It might not work well with Megatron-DeepSpeed setup, we need to check that.