composer icon indicating copy to clipboard operation
composer copied to clipboard

SSR with fractional epochs

Open ravi-mosaicml opened this issue 2 years ago • 2 comments

If max_duration is specified in epochs, and SSR is used, then the resulting time could be truncated to zero. For example with max_duration = '1ep' and ssr=0.5, then the trainer will scale max_duration to 0ep, which prevents training from running (since we have a check to ensure that max_duration > 0).

Instead, when using ssr and the ssr ratio would result in a fractional epoch, would it make sense to convert the max_duration to batches (assuming the dataloader is sized), samples (if num_samples is provided in the dataspec, or the dataset is sized), or tokens (if provided via the dataspec)? Then, we can scale training appropriately. For example, if max_duration=1ep, but len(dataloader)=100, and ssr=0.5, we would convert max_duration=50ba, and then train for 50 batches.

We should probably emit a log.info (or perhaps a log.warn) in this case, so the user knows that the last epoch will no longer be fully trained.

In addition, a test should be added to ensure that 1ep is also scaled correctly by SSR.

Originally posted by @ajaysaini725 in https://github.com/mosaicml/composer/pull/594#discussion_r814327729

ravi-mosaicml avatar Feb 28 '22 22:02 ravi-mosaicml

A few things we would have to assert for this to happen:

  1. Is the learning rate scheduler step-wise? If it's epoch-wise, this would lead to a silent failure that would be frustrating for the user. I think this proposal would be acceptable if the LR scheduler is stepwise.
  2. What is the checkpointing interval? Do we also cut the checkpointing interval by 0.5 if max_duration=1ep and ssr=0.5?
  3. What do we do if the SSR is fractional that leads to an incomplete sequence? This might be a broader question around our Time abstraction, but if SSR=0.33, and max_duration=102400tok, then max_duration=34133.33tok. Assuming a sequence length of 1024, then this leads to 33.33ba. Do we want to a) forget that last 1/3rd of a batch, b) create a new sequence 337 tokens, and the pad the rest, or c) round up and make it 34ba?

Highly agreed that we should emit a log.warn!

moinnadeem avatar Feb 28 '22 22:02 moinnadeem

A few things we would have to assert for this to happen:

  1. Is the learning rate scheduler step-wise? If it's epoch-wise, this would lead to a silent failure that would be frustrating for the user. I think this proposal would be acceptable if the LR scheduler is stepwise.

Epoch-wise schedulers are already converted to batches. The default behavior is step-wise, which should already be covered.

  1. What is the checkpointing interval? Do we also cut the checkpointing interval by 0.5 if max_duration=1ep and ssr=0.5?

We do not apply SSR on the checkpoint interval. If the interval is, say every epoch, then the checkpoint would still be saved at the end of an epoch (not every half epoch). Even if SSR would result in only 0.5 of the dataset being trained, we still increment the epoch counter to 1, which would result in the checkpoint being saved.

  1. What do we do if the SSR is fractional that leads to an incomplete sequence? This might be a broader question around our Time abstraction, but if SSR=0.33, and max_duration=102400tok, then max_duration=34133.33tok. Assuming a sequence length of 1024, then this leads to 33.33ba. Do we want to a) forget that last 1/3rd of a batch, b) create a new sequence 337 tokens, and the pad the rest, or c) round up and make it 34ba?

Batches, samples, and tokens are always integers, so they would have to be rounded down, up or down, or up. Don't have a strong preference which way to round, though. Currently we round down.

Highly agreed that we should emit a log.warn!

:+1:

ravi-mosaicml avatar Feb 28 '22 22:02 ravi-mosaicml

Closing for now as we're tracking elsewhere, but it's low priority. We're open to community PRs!

mvpatel2000 avatar Jun 22 '23 21:06 mvpatel2000