gpt-neox
gpt-neox copied to clipboard
Latest DeepSpeed Support
@StellaAthena @ShivanshuPurohit
Note: we will not merge this unless we decide to get rid of DeeperSpeed
This branch completely does away with DeeperSpeed, and instead is based on upstream DeepSpeed. It doesn't take many gpt-neox changes to do this, but we lose some of the DeeperSpeed features. Feel free to use this branch unless your gpt-neox code explicitly relies on DeeperSpeed features.
Tested with:
- [x] PP, MP > 1
- [x] Zero-[1,2,3]
- [ ] MoE [EleutherAI/gpt-neox/pull/677]
- [ ] Autotuning
- [ ] Curriculum learning
@Quentin-Anthony Can you list which DeeperSpeed features would be lost with this move?
@Quentin-Anthony Can you list which DeeperSpeed features would be lost with this move?
Small stuff like logging format, some more detailed timers, and the forward hooks functionality in deeperspeed. I've already pushed the major features into upstream DeepSpeed.
My thoughts are that most gpt-neox users don't need/rely on these features and can switch to the latest DeepSpeed.
@Quentin-Anthony Can you list which DeeperSpeed features would be lost with this move?
Small stuff like logging format, some more detailed timers, and the forward hooks functionality in deeperspeed. I've already pushed the major features into upstream DeepSpeed.
My thoughts are that most gpt-neox users don't need/rely on these features and can switch to the latest DeepSpeed.
The only thing I disagree with here is the detailed timers, which I and I think many others find quite useful. Would there be an easy way to make them part of GPT-NeoX as opposed to DeeperSpeed?
@Quentin-Anthony Can you list which DeeperSpeed features would be lost with this move?
Small stuff like logging format, some more detailed timers, and the forward hooks functionality in deeperspeed. I've already pushed the major features into upstream DeepSpeed. My thoughts are that most gpt-neox users don't need/rely on these features and can switch to the latest DeepSpeed.
The only thing I disagree with here is the detailed timers, which I and I think many others find quite useful. Would there be an easy way to make them part of GPT-NeoX as opposed to DeeperSpeed?
No there's no way to bring those out of DeeperSpeed. Should we update the DeeperSpeed main branch to just be the DeepSpeed main branch, but with timers (throwing everything else away)? We'd have to update it periodically, but merges would be pretty simple that way. I think bringing these timers into upstream DeepSpeed would be a hard sell.
Who would do the selling though?
Who would do the selling though?
Us to the DeepSpeed team. I'm saying it would be difficult to convince them that these timers are needed when they already have the FLOPs profiler and communication logger.