dvclive icon indicating copy to clipboard operation
dvclive copied to clipboard

logger: Add notifier to `next_step`?

Open daavoo opened this issue 4 years ago • 13 comments

Depending on the type of model to be trained, the time in between calls to next_step may vary significantly. In common deep learning scenarios, i.e. the keras callback, next_step is being called at the end of an epoch which could result in long times (maybe hours) in between calls.

It could be useful to have built-in support for optionally sending a notification each time next_step is being called.

Without changing dvclive, the user could just call a custom library (i.e. https://github.com/liiight/notifiers) after next_step:

class MetricsCallback(Callback):
    def on_epoch_end(self, epoch: int, logs: dict = None):
        logs = logs or {}
        for metric, value in logs.items():
            dvclive.log(metric, value)
        dvclive.next_step()
        notify('pushover', user='foo', token='bar', message=f'epoch: {epoch}')

But having the notification step built inside MetricLogger would have some benefits like access to internals (i.e. _metrics) and configuration options in addition to hiding complexity to the end user.

However, I'm not sure if it is worth to implement this feature inside dvclive or if it would be better to keep dvclive as lightweight as possible.

daavoo avatar Jun 15 '21 10:06 daavoo

@daavoo I'm trying to understand the motivation behind this? 😄
could you please elaborate on this? o you want to update the files more often? What notify() means?

EDIT: do you have any references in other ml logger frameworks to this functionality?

dmpetrov avatar Jun 15 '21 14:06 dmpetrov

@daavoo I'm trying to understand the motivation behind this? smile could you please elaborate on this? o you want to update the files more often? What notify() means?

Sorry about the lack of clarity.

The motivation comes from working with deep learning models that take a lot of time to train (i.e. hours or days). When working under that circumstances we always ended up writing some sort of "notification" code to complement or integrate into the ml logger. The main reason was to be able to monitor the train loop remotely (i.e. no need to look at the stdout in a terminal)

This notification code takes care of sending a message to some platform (i.e. e-mail, slack / discord / telegram channel, etc) containing information like the number of finished epoch (a.k.a step in dvclive) and associated metrics. We also used it to inform when exceptions occurred during the training loop.

notify() is usually a function that sends information as a message to an app.

EDIT: do you have any references in other ml logger frameworks to this functionality?

I think that in other ml loggers we usually have an associated UI with a view that is automatically being updated as the plots/information are being logged (Related with this Studio issue: https://github.com/iterative/studio-support/issues/13)

In addition to that, some ml loggers also provide "notification" utilities:

  • https://docs.wandb.ai/guides/track/advanced/alert
  • https://www.comet.ml/docs/python-sdk/Experiment/#experimentsend_notification

Beyond existing functionality in other ml loggers, I have found different teams and open source communities solving this problem, including some I work/have worked with:

  • https://github.com/huggingface/knockknock
  • https://forums.fast.ai/t/training-metrics-as-notifications-on-mobile-using-callbacks/17330

daavoo avatar Jun 15 '21 15:06 daavoo

I've just discovered another open-source tool focused on this kind of functionality:

https://github.com/labmlai/labml

daavoo avatar Jul 14 '21 08:07 daavoo

Related to #91

pared avatar Jul 14 '21 14:07 pared

Another open source tool:

https://github.com/aporia-ai/mlnotify

daavoo avatar Sep 09 '21 20:09 daavoo

Interesting integration between DagsHub and New Relic highlighting alerts as one of the main features:

https://dagshub.com/blog/real-time-machine-learning-monitroing-new-relic-dagshub/

daavoo avatar Oct 21 '21 13:10 daavoo

Related to https://github.com/iterative/dvclive/issues/91#issuecomment-1035015144, I think the most useful integration here would be making it dead simple to send full reports (similar to the html today) through supported channels.

For example, the slack api could probably be used generate a message with the metrics and plot images, and similar for email (personally, I would prioritize slack because it's more collaborative and probably easier for users to set up).

The local html generated now could just be one report/alert format in that case (and the cml markdown report another).

dberenbaum avatar Feb 10 '22 16:02 dberenbaum

Related to #91 (comment), I think the most useful integration here would be making it dead simple to send full reports (similar to the html today) through supported channels.

For example, the slack api could probably be used generate a message with the metrics and plot images, and similar for email (personally, I would prioritize slack because it's more collaborative and probably easier for users to set up).

The local html generated now could just be one report/alert format in that case (and the cml markdown report another).

That would be the way to go and the original idea using https://github.com/liiight/notifiers .

For metrics is very feasible. However, the images / rendered plots would be kind of tricky because most channels don't have support to directly send images. We could rely on cml publish to host the images and send the link (like in cml mardkown report) but this would imply CML as dependency for any channel.

daavoo avatar Feb 10 '22 18:02 daavoo

For metrics is very feasible. However, the images / rendered plots would be kind of tricky because most channels don't have support to directly send images. We could rely on cml publish to host the images and send the link (like in cml mardkown report) but this would imply CML as dependency for any channel.

Rather than wrapping a general-purpose text-based notifier with support for many providers, it might be more useful to focus on providers in which we can send the entire report, including images/rendered plots. AFAIK this should be feasible without hosting in Slack (https://api.slack.com/methods/files.upload) and email (https://docs.python.org/3/library/email.examples.html).

I'm not sure text-based alerts add enough value (we could instead have a doc or blog post showing how to use dvclive + https://github.com/liiight/notifiers). Full reports with plots seem like a more unique feature, and they extend dvclive's initial value prop of lightweight live monitoring for model training, providing serverless alerting and reporting anywhere without needing to access the training machine. Since a lot of training happens in headless environments anyway, this seems pretty useful to me. What do you think?

dberenbaum avatar Feb 10 '22 18:02 dberenbaum

I'm not sure text-based alerts add enough value (we could instead have a doc or blog post showing how to use dvclive + https://github.com/liiight/notifiers). Full reports with plots seem like a more unique feature, and they extend dvclive's initial value prop of lightweight live monitoring for model training, providing serverless alerting and reporting anywhere without needing to access the training machine. Since a lot of training happens in headless environments anyway, this seems pretty useful to me. What do you think?

I think it's useful and would be directly adding value for DVCLive.

I'm a little "worried" about how easy would be to maintain because Report Providers sounds like integrations potentially growing perpendicular to ML Frameworks.

So far, looking at slack and email APIs, it doesn't look that bad.

daavoo avatar Feb 11 '22 18:02 daavoo

@shcheklein mentioned that it might be worthwhile to look into RSS feed aggregators. There are some parallels in how RSS expects a particular schema of elements (https://validator.w3.org/feed/docs/rss2.html) and can publish them in a consistent format, so maybe it can give some ideas for how to implement.

dberenbaum avatar Feb 16 '22 13:02 dberenbaum

So it's about tidying up this sort of thing?

from tqdm.contrib.{slack,telegram,discord} import trange

with trange(live.get_step(), epochs, unit="epoch") as pbar:
    for epoch in pbar:
        ...
        live.log("loss", loss)
        pbar.set_postfix(loss=loss)
        live.next_step()

i.e. providing a callback interface?

live.set_callback(
    on_log=lambda name, metric: pbar.set_postfix({name: metric}),
    on_step=lambda new_step: print(f"starting epoch {new_step:>5d}", file=some_log))

Or is it more advanced? live.notify_slack(on_step=True, channel="#...", token="...")

casperdcl avatar Apr 04 '22 19:04 casperdcl

Sorry @casperdcl, I missed this comment. It's closer to the latter advanced usage. Probably channel, token, etc. can be set in environment variables, and the method can be something like live.make_report(type="slack").

dberenbaum avatar May 09 '22 13:05 dberenbaum

I don't think we are likely to do this now that we have live metrics in Studio and other solutions exist for alerting.

dberenbaum avatar Mar 06 '23 22:03 dberenbaum