accelerate icon indicating copy to clipboard operation
accelerate copied to clipboard

Feature request: ability to truncate datasets on split rather than pad

Open bghira opened this issue 1 year ago • 2 comments

As a Developer working on latent diffusion model training via SimpleTuner, it has become evident that the built-in mechanism for splitting datasets across processes is not smart enough to apply in cases where a robust sample-tracking mechanism is in use.

SimpleTuner uses a 'seen' list to keep track of samples per epoch so that we do not inadvertently oversample. This has the side effect of padding not actually working, since the repeated samples in the list are simply discarded.

What happens next, is that one of the GPUs runs out of data just before the other would have, and then causes a deadlock while the main process waits for the backward pass, which will never come.

My solution was to truncate the sets I'm splitting, taking into account the batch_size * gradient_steps * num_processes and then split it. But, it occurred to me, having this be built-in would be nice.

bghira avatar Oct 01 '23 21:10 bghira