i-am-a-nerd
i-am-a-nerd copied to clipboard
DeepSpeed Investigation: What I Learned | IAmANerd
DeepSpeed Investigation: What I Learned | IAmANerd
An investigation into the awesome DeepSpeed library for training large models on a single GPU!
https://nathancooper.io/i-am-a-nerd/deepspeed/deep-learning/2021/05/03/DeepSpeed-Investigation.html
Hi Nathan,
Thank you for trying out DeepSpeed. I am a researcher in the DeepSpeed team. I wanted to share a few comments here that might be helpful:
Small Models: DeepSpeed should be used without CPU Offloading for small models that easily fit in GPU memory, and you will likely see some performance improvement with DeepSpeed over the baseline. Using CPU offloading or ZeRO for tiny models that easily fit in GPU memory is not the intended use case for ZeRO or CPU Offload.
Large Models: For larger models, when you do enable CPU Offloading with ZeRO, the memory savings on the GPU will allow you to train with larger batch sizes, which is crucial to achieving good performance on a single GPU. It seems from your test script that you might be using a batch size of 1 which might be why you were seeing significant performance drop. May we suggest you try with the largest batch size that DeepSpeed with ZeRO offload will allow you to fit, and use that to compare the training times for the 500 samples you are testing.
@samyam thanks for the comment and discussing the use case for a single GPU, it clears up a lot of my confusion. I will test that out and update with my results!
@samyam I did what you recommended and got a lot better results using larger batch sizes (doubling the batch size of the t5-large model compared to not using deepspeed). One interesting thing I found while doing my experiments was that for the t5-small model, I could use a larger batch without deepspeed than with it, do you have an idea why this would be? Anyways, I've updated my blog post with the new results I got and updated the text to discuss the importance of using larger batch sizes.
Thanks for your comments and helping create the awesome DeepSpeed library!
@ncoop57, @samyam, for training larger batch sizes on smaller GPUs, can't we just use gradient accumulation rather than CPU offloading? My intuition is that it'll be faster than CPU offloading but I guess it'd be easy to run a test and check. The only reason I can think of using CPU offloading on a single GPU is if the model itself is too large to fit on the GPU, that I can't even train with a batch size of 1. In that case CPU offloading would perhaps make it possible (or would it?)
Is my understanding correct?
@thakkarparth007 that is a good point that I didn't think of. To me it does seem like gradient accum would be better for most cases except for the one you specify unless there are additional optimizations?