DeepSpeed
DeepSpeed copied to clipboard
[REQUEST] ZeRO3 doc - support for wrapping model sub-components seperately for training
Is your feature request related to a problem? Please describe.
it is very difficult to train MM models (e.g., multi-image chat/video chat) models in zero3 because the effective vision batch>>text batch. So even if the LLM is much larger then the image encoder, and it much slower to run forward loop for, you still need to limit the batch size based of the vision encoder as apposed to the LLM.
Describe the solution you'd like
would it be possible to wrap different parts of the model separately, such that if you are not training the vision encoder you don't need to include it in the computational graph, and therefore can encode the images/frames in sub-batches to allow efficient forward passes
Describe alternatives you've considered this works in zero2. Would be nice if there was a guild on this (similar to FSDP, where sometimes you wrap transformers layers seperately to reduce max GPU utilization) on how to do this in deepseed.