DeepSpeedExamples
DeepSpeedExamples copied to clipboard
Dynamic batch support
Machine Translation usually takes dynamically sized batch composed of X tokens instead of X sentences as training input. I'm wondering why deepspeed requires specifying train_batch_size and train_micro_batch_size_per_gpu, both of which refer to the number of samples. Is this a concern for implementation details? Or is it possible to support dynamic size as in the case of machine translation without extra cost of efficiency and memory usage?
The primary reason is to figure out the number of required gradient accumulation steps.