composer
composer copied to clipboard
SSR with fractional epochs
If max_duration
is specified in epochs, and SSR is used, then the resulting time could be truncated to zero. For example with max_duration = '1ep'
and ssr=0.5
, then the trainer will scale max_duration
to 0ep
, which prevents training from running (since we have a check to ensure that max_duration > 0).
Instead, when using ssr and the ssr ratio would result in a fractional epoch, would it make sense to convert the max_duration
to batches (assuming the dataloader is sized), samples (if num_samples is provided in the dataspec, or the dataset is sized), or tokens (if provided via the dataspec)? Then, we can scale training appropriately. For example, if max_duration=1ep
, but len(dataloader)=100
, and ssr=0.5
, we would convert max_duration=50ba
, and then train for 50 batches.
We should probably emit a log.info
(or perhaps a log.warn
) in this case, so the user knows that the last epoch will no longer be fully trained.
In addition, a test should be added to ensure that 1ep
is also scaled correctly by SSR.
Originally posted by @ajaysaini725 in https://github.com/mosaicml/composer/pull/594#discussion_r814327729
A few things we would have to assert for this to happen:
- Is the learning rate scheduler step-wise? If it's epoch-wise, this would lead to a silent failure that would be frustrating for the user. I think this proposal would be acceptable if the LR scheduler is stepwise.
- What is the checkpointing interval? Do we also cut the checkpointing interval by 0.5 if
max_duration=1ep
andssr=0.5
? - What do we do if the SSR is fractional that leads to an incomplete sequence? This might be a broader question around our Time abstraction, but if
SSR=0.33
, andmax_duration=102400tok
, thenmax_duration=34133.33tok
. Assuming a sequence length of 1024, then this leads to33.33ba
. Do we want to a) forget that last 1/3rd of a batch, b) create a new sequence 337 tokens, and the pad the rest, or c) round up and make it34ba
?
Highly agreed that we should emit a log.warn
!
A few things we would have to assert for this to happen:
- Is the learning rate scheduler step-wise? If it's epoch-wise, this would lead to a silent failure that would be frustrating for the user. I think this proposal would be acceptable if the LR scheduler is stepwise.
Epoch-wise schedulers are already converted to batches. The default behavior is step-wise, which should already be covered.
- What is the checkpointing interval? Do we also cut the checkpointing interval by 0.5 if
max_duration=1ep
andssr=0.5
?
We do not apply SSR on the checkpoint interval. If the interval is, say every epoch, then the checkpoint would still be saved at the end of an epoch (not every half epoch). Even if SSR would result in only 0.5 of the dataset being trained, we still increment the epoch counter to 1, which would result in the checkpoint being saved.
- What do we do if the SSR is fractional that leads to an incomplete sequence? This might be a broader question around our Time abstraction, but if
SSR=0.33
, andmax_duration=102400tok
, thenmax_duration=34133.33tok
. Assuming a sequence length of 1024, then this leads to33.33ba
. Do we want to a) forget that last 1/3rd of a batch, b) create a new sequence 337 tokens, and the pad the rest, or c) round up and make it34ba
?
Batches, samples, and tokens are always integers, so they would have to be rounded down, up or down, or up. Don't have a strong preference which way to round, though. Currently we round down.
Highly agreed that we should emit a
log.warn
!
:+1:
Closing for now as we're tracking elsewhere, but it's low priority. We're open to community PRs!