composer icon indicating copy to clipboard operation
composer copied to clipboard

Help users reduce `ProgressBarLogger` print frequency

Open abhi-mosaic opened this issue 2 years ago • 8 comments

Making this a draft PR in case I'm biting off more than I can chew...

This PR does the folllowing:

  • Simplify typing of console_log_level: Union[str, int, LogLevel] because it seems pretty unlikely someone will build a whole Callable requiring imports of LogLevel and State
    • this also better matches the name of the variable
    • and matches the signature of other LoggerDestinations like InMemoryLogger.log_level
  • Set default to LogLevel.BATCH which I'm pretty sure makes more sense
  • Add console_log_every_n_batches: int = 1 to allow users to easily scale down the frequency
  • update some YAMLs to remove references to progress_bar which is now added by default in the Trainer

Questions from Abhi:

  • I see that the ProgressBarLogger is sort of added automatically to the Trainer, but then I also see all the ProgressBarLogger args passed in as well :( I wonder if that's a good practice or whether it would make more sense to just have trainer.progress_bar_logger: bool = True which instantiates a good default ProgressBarLogger, but then if you want to do anything custom, you ought to just construct the object yourself... It makes me sad to see 4-5 ProgressBarLogger specific args littering the trainer signature.

relevant code snippet from trainer.py

loggers = list(ensure_tuple(loggers))
        if any(isinstance(x, ProgressBarLogger) for x in loggers):
            warnings.warn(
                DeprecationWarning(
                    (f'Specifying the {ProgressBarLogger.__name__} via `loggers` is deprecated. Instead, '
                     'please specify `progress_bar`, `log_to_console`, `log_level`, and `stream` arguments when '
                     'constructing the trainer. If specified, these arguments will be ignored, as the '
                     f'{ProgressBarLogger.__name__} was already created.')))
  • Is log_every_n_batches useful enough that it should be made global, or something that each LoggerDestination implements? I know we've had some debate before and I have a light preference for per-logger args, because in the cloud setting, I probably want super fine detailed logs going to WandB, but I want my console to be relatively sparse, maybe logging every 1000 batches.

abhi-mosaic avatar Aug 03 '22 20:08 abhi-mosaic

The new user experience:

Default

$ composer examples/run_composer_trainer.py -f composer/yamls/models/bert-base.yaml
...
train                           0%|                         | 229376/286720000 [00:45<10:33:07, 7541.75sp/s, loss/train=9.4105]

Default --log_to_console

$ composer examples/run_composer_trainer.py -f composer/yamls/models/bert-base.yaml --log_to_console
...
[BATCH][sample=94208/286720000]: { "trainer/global_step": 23, "trainer/batch_idx": 23, }
[BATCH][sample=94208/286720000]: { "trainer/grad_accum": 1, }
[BATCH][sample=94208/286720000]: { "loss/train": 9.9996, }
[BATCH][sample=98304/286720000]: { "trainer/global_step": 24, "trainer/batch_idx": 24, }
[BATCH][sample=98304/286720000]: { "trainer/grad_accum": 1, }
[BATCH][sample=98304/286720000]: { "loss/train": 9.9705, }
[BATCH][sample=102400/286720000]: { "trainer/global_step": 25, "trainer/batch_idx": 25, }
[BATCH][sample=102400/286720000]: { "trainer/grad_accum": 1, }
[BATCH][sample=102400/286720000]: { "loss/train": 9.9430, }
[BATCH][sample=106496/286720000]: { "trainer/global_step": 26, "trainer/batch_idx": 26, }
train                           0%|                         | 106496/286720000 [00:30<10:41:37, 7445.01sp/s, loss/train=9.9430]

Using --log_to_console and --console_log_every_n_batches 5

$ composer examples/run_composer_trainer.py -f composer/yamls/models/bert-base.yaml --log_to_console --console_log_every_n_batches 5
...
[BATCH][sample=163840/286720000]: { "trainer/global_step": 40, "trainer/batch_idx": 40, }
[BATCH][sample=163840/286720000]: { "trainer/grad_accum": 1, }
[BATCH][sample=163840/286720000]: { "loss/train": 9.6202, }
[BATCH][sample=184320/286720000]: { "trainer/global_step": 45, "trainer/batch_idx": 45, }
[BATCH][sample=184320/286720000]: { "trainer/grad_accum": 1, }
[BATCH][sample=184320/286720000]: { "loss/train": 9.5386, }
[BATCH][sample=204800/286720000]: { "trainer/global_step": 50, "trainer/batch_idx": 50, }
[BATCH][sample=204800/286720000]: { "trainer/grad_accum": 1, }
[BATCH][sample=204800/286720000]: { "loss/train": 9.4781, }

train                           0%|                         | 217088/286720000 [00:45<11:25:49, 6962.53sp/s, loss/train=9.4541

abhi-mosaic avatar Aug 03 '22 21:08 abhi-mosaic

I see that the ProgressBarLogger is sort of added automatically to the Trainer, but then I also see all the ProgressBarLogger args passed in as well :( I wonder if that's a good practice or whether it would make more sense to just have trainer.progress_bar_logger: bool = True which instantiates a good default ProgressBarLogger, but then if you want to do anything custom, you ought to just construct the object yourself... It makes me sad to see 4-5 ProgressBarLogger specific args littering the trainer signature.

From what I recall, we are deprecating allowing folks to pass in the ProgressBarLogger as a callback, for a technical reason that I do not quite recall.. I think this has to do with how the console is handeled.. cc: @ravi-mosaicml .

hanlint avatar Aug 04 '22 18:08 hanlint

I see that the ProgressBarLogger is sort of added automatically to the Trainer, but then I also see all the ProgressBarLogger args passed in as well :( I wonder if that's a good practice or whether it would make more sense to just have trainer.progress_bar_logger: bool = True which instantiates a good default ProgressBarLogger, but then if you want to do anything custom, you ought to just construct the object yourself... It makes me sad to see 4-5 ProgressBarLogger specific args littering the trainer signature.

From what I recall, we are deprecating allowing folks to pass in the ProgressBarLogger as a callback, for a technical reason that I do not quite recall.. I think this has to do with how the console is handeled.. cc: @ravi-mosaicml .

Yes, all prints should go through tqdm so they do not conflict with the progress bar. This allows tqdm to put the prints "above" the current pbar, which limits progress bar jumping and terminal clogging.

ravi-mosaicml avatar Aug 04 '22 22:08 ravi-mosaicml

Is log_every_n_batches useful enough that it should be made global, or something that each LoggerDestination implements? I know we've had some debate before and I have a light preference for per-logger args, because in the cloud setting, I probably want super fine detailed logs going to WandB, but I want my console to be relatively sparse, maybe logging every 1000 batches.

From our discussion with Evan, I think we decided that we would offload the frequency to the caller, rather than add it to each logger?

ravi-mosaicml avatar Aug 04 '22 22:08 ravi-mosaicml

From our discussion with Evan, I think we decided that we would offload the frequency to the caller, rather than add it to each logger?

I think it makes sense to allow the caller to do things on some custom frequency if it wants to, but in addition it seems pretty useful to have a logger- or global- scoped frequency limiter ... for instance, right now I have like 4 different things producing logs: the Trainer itself, my LRMonitor, my SpeedMonitor, the GradientClippingAlgorithm. I certainly don't want to pass log_every_n_batches to every one of these objects, right?

abhi-mosaic avatar Aug 04 '22 22:08 abhi-mosaic

Questions from Abhi:

  • I see that the ProgressBarLogger is sort of added automatically to the Trainer, but then I also see all the ProgressBarLogger args passed in as well :( I wonder if that's a good practice or whether it would make more sense to just have trainer.progress_bar_logger: bool = True which instantiates a good default ProgressBarLogger, but then if you want to do anything custom, you ought to just construct the object yourself... It makes me sad to see 4-5 ProgressBarLogger specific args littering the trainer signature.

I very much agree with this! I didn't do it in my PR, but want to do it eventually. I'd certainly be down if you did it in yours and I could deal with the merge.

relevant code snippet from trainer.py

loggers = list(ensure_tuple(loggers))
        if any(isinstance(x, ProgressBarLogger) for x in loggers):
            warnings.warn(
                DeprecationWarning(
                    (f'Specifying the {ProgressBarLogger.__name__} via `loggers` is deprecated. Instead, '
                     'please specify `progress_bar`, `log_to_console`, `log_level`, and `stream` arguments when '
                     'constructing the trainer. If specified, these arguments will be ignored, as the '
                     f'{ProgressBarLogger.__name__} was already created.')))

Let's remove this.

I am fan of the pattern of one arg in the trainer to enable and good defaults that are well-documented. Then, if you want customizable behavior, you can pass in your own. I really think we should move towards the pattern of IF we create an object under the hood that we want to abstract away from the user then we ONLY expose one argument for configuring it in the trainer.

  • Is log_every_n_batches useful enough that it should be made global, or something that each LoggerDestination implements? I know we've had some debate before and I have a light preference for per-logger args, because in the cloud setting, I probably want super fine detailed logs going to WandB, but I want my console to be relatively sparse, maybe logging every 1000 batches.

I like that actually. I think removing log_levels, defaulting per-batch, and then exposing log_every_n_batches is fine.

Two follow-up questions:

  1. What do we do about per-microbatch if someone wants that (another arg I guess?)
  2. I think it's becoming clear that ProgressBarLogger and FileLogger should really be separate from the other loggers potentially? it's hard to do a one size fits all solution when ProgressBarLogger is just very different conceptually from WandBLogger and the others.

eracah avatar Aug 05 '22 20:08 eracah

From our discussion with Evan, I think we decided that we would offload the frequency to the caller, rather than add it to each logger?

I think it makes sense to allow the caller to do things on some custom frequency if it wants to, but in addition it seems pretty useful to have a logger- or global- scoped frequency limiter ... for instance, right now I have like 4 different things producing logs: the Trainer itself, my LRMonitor, my SpeedMonitor, the GradientClippingAlgorithm. I certainly don't want to pass log_every_n_batches to every one of these objects, right?

As mentioned above I will remove log_levels, but yeah we might have to add a log_every_n_batches to every LoggerDestination (not every Callback that calls a Logger.log function!), but IF ProgressBarLogger was separate from the others, we could set different defaults for ProgressBarLogger vs. WandBLogger...

eracah avatar Aug 05 '22 20:08 eracah

Evaluators have an optional eval_interval argument. If it's not passed the evaluator default to using the trainer's eval_interval. How do people feel about implementing log_every_n_batches similarly to how we implement eval_interval, such that you can specify log_every_n_batches for each logger and if it's not specified it falls back to the trainer value (which has a sensible default). And perhaps renaming log_every_n_batches to log_interval for brevity?

growlix avatar Aug 06 '22 02:08 growlix