accelerate
accelerate copied to clipboard
Feature request: ability to truncate datasets on split rather than pad
As a Developer working on latent diffusion model training via SimpleTuner, it has become evident that the built-in mechanism for splitting datasets across processes is not smart enough to apply in cases where a robust sample-tracking mechanism is in use.
SimpleTuner uses a 'seen' list to keep track of samples per epoch so that we do not inadvertently oversample. This has the side effect of padding not actually working, since the repeated samples in the list are simply discarded.
What happens next, is that one of the GPUs runs out of data just before the other would have, and then causes a deadlock while the main process waits for the backward pass, which will never come.
My solution was to truncate the sets I'm splitting, taking into account the batch_size * gradient_steps * num_processes
and then split it. But, it occurred to me, having this be built-in would be nice.