DeepSpeed Dynamic/variable batch size support

For the model I am training, I am relying on a custom Sampler, that returns variable batch sizes. My task at hand is translation, where I following Attention is all you need (2017) create batches based on total token count in a batch, which given the variable length input, results in batches of varying numbers of examples (examples here being one source/target text translation pair).

For regular DDP based training, this worked fine, by simply creating a distributed version of this sampler, to split the variable size batch into sub-batches based on the GPU rank. For DeepSpeed however, I am forced to provide either train_micro_batch_size_per_gpu or train_batch_size, both my current understanding tells me are based on the number of examples in the batch.

As the number of examples in my batch varies for each batch, and I just want to configure the accumulation based on batch count, rather than batch size, I'm not sure how to achieve this with DeepSpeed's configuration.

Am I misunderstanding the impact of the configuration variables, missing some other configuration, or is this not possible to achieve at the moment?

May 06 '21 12:05 ecly

I'm also super interested in knowing more about this. Happy to lend a hand, so to see it available faster!

Apr 15 '22 02:04 aniruddhakal

@ecly, apologies that this request somehow slipped through. I wonder what solution you ended up with and whether you are still interested in DeepSpeed support?

@aniruddhakal, thanks for bumping this back to our attention. Is it okay to wait @ecly to respond in order to decide next steps?

Apr 18 '22 13:04 tjruwase

@tjruwase we ended up for the most part just using pure DDP in PyTorch. We did have moderate success using Fairscale which supported the variable batch sizes out of the box, but didn't find any value from doing so with our model size of ~200M parameters. The problem is very much still relevant to us, as we'd still like to further adopt DeepSpeed, but this is a blocker that makes it non-trivial for us to adopt.

Apr 18 '22 16:04 ecly

@tjruwase we ended up for the most part just using pure DDP in PyTorch. We did have moderate success using Fairscale which supported the variable batch sizes out of the box, but didn't find any value from doing so with our model size of ~200M parameters. The problem is very much still relevant to us, as we'd still like to further adopt DeepSpeed, but this is a blocker that makes it non-trivial for us to adopt.

Hello，how to use dynamic batch in ddp, can you give an example. due to the dynamic batch size, the number of batches allocated by different ranks is inconsistent. This is what makes different ranks unable to communicate in DDP. Are there other solutions？

Jun 09 '22 08:06 wangleiofficial

@ecly, I am also interested in your dynamic batch in ddp. If you can share some client code with us, that would help with DeepSpeed support. Thanks!

Jun 09 '22 13:06 tjruwase

@tjruwase infinibatch may be a good choice for the dynamic batch in ddp. Notice, Dataset with the DistributedSampler may be better than infinibatch for validation set.

Jun 13 '22 03:06 wangleiofficial

Hey @tjruwase and @wangleiofficial

As my original question is starting to be a bit old, I think I perhaps need to retest on a newer version of DeepSpeed to confirm that it's still the case. But nonetheless, I'll share a bit more details below.

The idea is just that we limit batches in the number of total tokens in the batch (to provide a similar amount of signal for learning in each minibatch), rather than the number of examples in the context of training Transformers on text inputs of varying length (in our case for Machine Translation). The code for our MaxTokensBatchSampler that we use as a batch_sampler with the PyTorch DataLoader, is similar in nature to the one used in fairseq: https://github.com/facebookresearch/fairseq/blob/b5a039c292facba9c73f59ff34621ec131d82341/fairseq/data/data_utils.py#L282

We adapt it for DDP with only a tiny bit of code:

class DistributedMaxTokensBatchSampler(DistributedSampler, MaxTokensBatchSampler):

    def __init__(self, dataset: TranslationDataset, batch_max_tokens: int, **kwargs):
        DistributedSampler.__init__(self, dataset)
        MaxTokensBatchSampler.__init__(self, dataset, batch_max_tokens, **kwargs)

    def __iter__(self) -> Iterator[List[int]]:
        iterator = MaxTokensBatchSampler.__iter__(self)
        return itertools.islice(iterator, self.rank, None, self.num_replicas)

    def __len__(self):
        return len(self.batches) // self.num_replicas

This approach can effectively result in batches with different numbers of examples, but where they are similarly sized in terms of the number of tokens. It works out of the box with DDP from our experience.

Jun 13 '22 09:06 ecly

Keep tracking this issue

Mar 01 '23 09:03 HsunGong