accelerate
accelerate copied to clipboard
[Feature]: CPU+GPU using accelerate?
Hey Everyone!
Could accelerate make it possible to distribute the processes between CPU + GPU and RAM + VRAM?
As I'm not familiar with it and only tested accelerate config command, it might be a feature request
Do you mean for training and using some of the model on CPU and some of the model on GPU? Or can you describe a bit more with the ideal workflow you're imagining?
Hi @muellerzr
My rig has RTX 2070 GPU with 8GB VRAM + AMD Ryzen 3900X CPU + 64GB RAM but actually only one half of the system is being used while using Stable Diffusion. The only thing is in my mind is using CPU + System RAM & GPU + VRAM simultaneously. Thought that accelerate could distribute processes between these two different processing units to train larger models (even with slower performance) or for example create larger images in Stable Diffusion and also getting rid of CUDA out of memory error.
I hope I'm not wrong!
This is something we're mildly looking into (the ability to train using the same methodology as big model inference), if this is accurate to what you are thinking.
This is something we're mildly looking into (the ability to train using the same methodology as big model inference), if this is accurate to what you are thinking.
Yes exactly! It would be great to harness the power of both the GPU and CPU simultaneously for parallel processing. This would not only accelerate our training process but also allow us to process data with greater efficiency using our trained models. It's certainly worth exploring for future improvements.
Hey @muellerzr if this has been approved/needs someone to work on, I'm willing to look into it. Would be needing some advice/help though
Hi @rishabbala, if you'd like to contribute or see what you can get to, that'd be great! I'll be looking into this soon-ish but there's other pressing matters I need to do first. Here's a colab notebook with how far I got, basically there's an issue with autograd we need to consider so that the gradients backprop properly and efficiently: https://colab.research.google.com/drive/1s6tq_zcaXBnP3Ldj42CJ0gg4VXTZDfJ7?usp=sharing
Hi @muellerzr, I went over your notebook and got the overall idea of what we are trying to do. If I understand correctly, we want to move the weights and intermediate tensors to CPU after their forward call, and move them back to GPU before we perform backward. Is this correct? Can you let me know what the current limitation or issue is with what you've implemented and how I can proceed? Also, I was wondering if using a flattened view of tensors when moving to CPU would be better, as then the number of reference calls to access the tensor would be lower.
Any updates on this?