aitextgen icon indicating copy to clipboard operation
aitextgen copied to clipboard

DeepSpeed + TPU support via transformer's Trainer

Open minimaxir opened this issue 4 years ago • 4 comments

Currently, training via pytorch-lightning's implementation of DeepSpeed/TPUs is not working, and it's impossible to debug where the issues lie (i.e. within aitextgen, transformers, pytorch-lightning, or pytorch-xla) since the entire ecosystem is very fragile and error messages are unhelpful.

A short-term workaround is to use transformer's native Trainer for DeepSpeed + TPUs (and only those specific use cases for now) as it limits potential breakage, and also serves as a baseline for pytorch-lightning's approach when that is more stabilized.

The downside is that Trainer is not as good as pytorch-lightning UX-wise, but given that DeepSpeed + TPUs are a more niche use case for power users. that's acceptable.

minimaxir avatar Mar 02 '21 03:03 minimaxir

Now Zero-3 Offload is available which in theory should be easier to implement (once it works with base Transformers)

minimaxir avatar Mar 13 '21 17:03 minimaxir

thanks for highlighting this!

@SeanNaren can help get this solved on the PL side.

williamFalcon avatar Mar 14 '21 22:03 williamFalcon

hey @minimaxir what issues are you running into? If you're able to point to issues I can help escalate/resolve them for PL!

ZeRO 3 Offload has it's own quirks that will require HuggingFace Transformers and us both to figure out, so it may take a bit longer to integrate however we're working together on this where we can. We do have experimental support in place, and can give some pointers if you're keen to try :)

SeanNaren avatar Mar 14 '21 22:03 SeanNaren

Dear @minimaxir,

Would you mind joining Pytorch Lightning Slack. I sent you an invitation. We can coordinate efforts there to resolve your issues with Sean and I.

Best, T.C

tchaton avatar Mar 15 '21 08:03 tchaton