dvclive
dvclive copied to clipboard
logger: Add notifier to `next_step`?
Depending on the type of model to be trained, the time in between calls to next_step may vary significantly. In common deep learning scenarios, i.e. the keras callback, next_step is being called at the end of an epoch which could result in long times (maybe hours) in between calls.
It could be useful to have built-in support for optionally sending a notification each time next_step is being called.
Without changing dvclive, the user could just call a custom library (i.e. https://github.com/liiight/notifiers) after next_step:
class MetricsCallback(Callback):
def on_epoch_end(self, epoch: int, logs: dict = None):
logs = logs or {}
for metric, value in logs.items():
dvclive.log(metric, value)
dvclive.next_step()
notify('pushover', user='foo', token='bar', message=f'epoch: {epoch}')
But having the notification step built inside MetricLogger would have some benefits like access to internals (i.e. _metrics) and configuration options in addition to hiding complexity to the end user.
However, I'm not sure if it is worth to implement this feature inside dvclive or if it would be better to keep dvclive as lightweight as possible.
@daavoo I'm trying to understand the motivation behind this? 😄
could you please elaborate on this? o you want to update the files more often? What notify() means?
EDIT: do you have any references in other ml logger frameworks to this functionality?
@daavoo I'm trying to understand the motivation behind this? smile could you please elaborate on this? o you want to update the files more often? What
notify()means?
Sorry about the lack of clarity.
The motivation comes from working with deep learning models that take a lot of time to train (i.e. hours or days). When working under that circumstances we always ended up writing some sort of "notification" code to complement or integrate into the ml logger. The main reason was to be able to monitor the train loop remotely (i.e. no need to look at the stdout in a terminal)
This notification code takes care of sending a message to some platform (i.e. e-mail, slack / discord / telegram channel, etc) containing information like the number of finished epoch (a.k.a step in dvclive) and associated metrics. We also used it to inform when exceptions occurred during the training loop.
notify() is usually a function that sends information as a message to an app.
EDIT: do you have any references in other ml logger frameworks to this functionality?
I think that in other ml loggers we usually have an associated UI with a view that is automatically being updated as the plots/information are being logged (Related with this Studio issue: https://github.com/iterative/studio-support/issues/13)
In addition to that, some ml loggers also provide "notification" utilities:
- https://docs.wandb.ai/guides/track/advanced/alert
- https://www.comet.ml/docs/python-sdk/Experiment/#experimentsend_notification
Beyond existing functionality in other ml loggers, I have found different teams and open source communities solving this problem, including some I work/have worked with:
- https://github.com/huggingface/knockknock
- https://forums.fast.ai/t/training-metrics-as-notifications-on-mobile-using-callbacks/17330
I've just discovered another open-source tool focused on this kind of functionality:
https://github.com/labmlai/labml
Related to #91
Another open source tool:
https://github.com/aporia-ai/mlnotify
Interesting integration between DagsHub and New Relic highlighting alerts as one of the main features:
https://dagshub.com/blog/real-time-machine-learning-monitroing-new-relic-dagshub/
Related to https://github.com/iterative/dvclive/issues/91#issuecomment-1035015144, I think the most useful integration here would be making it dead simple to send full reports (similar to the html today) through supported channels.
For example, the slack api could probably be used generate a message with the metrics and plot images, and similar for email (personally, I would prioritize slack because it's more collaborative and probably easier for users to set up).
The local html generated now could just be one report/alert format in that case (and the cml markdown report another).
Related to #91 (comment), I think the most useful integration here would be making it dead simple to send full reports (similar to the html today) through supported channels.
For example, the slack api could probably be used generate a message with the metrics and plot images, and similar for email (personally, I would prioritize slack because it's more collaborative and probably easier for users to set up).
The local html generated now could just be one report/alert format in that case (and the cml markdown report another).
That would be the way to go and the original idea using https://github.com/liiight/notifiers .
For metrics is very feasible. However, the images / rendered plots would be kind of tricky because most channels don't have support to directly send images. We could rely on cml publish to host the images and send the link (like in cml mardkown report) but this would imply CML as dependency for any channel.
For metrics is very feasible. However, the images / rendered plots would be kind of tricky because most channels don't have support to directly send images. We could rely on
cml publishto host the images and send the link (like in cml mardkown report) but this would imply CML as dependency for any channel.
Rather than wrapping a general-purpose text-based notifier with support for many providers, it might be more useful to focus on providers in which we can send the entire report, including images/rendered plots. AFAIK this should be feasible without hosting in Slack (https://api.slack.com/methods/files.upload) and email (https://docs.python.org/3/library/email.examples.html).
I'm not sure text-based alerts add enough value (we could instead have a doc or blog post showing how to use dvclive + https://github.com/liiight/notifiers). Full reports with plots seem like a more unique feature, and they extend dvclive's initial value prop of lightweight live monitoring for model training, providing serverless alerting and reporting anywhere without needing to access the training machine. Since a lot of training happens in headless environments anyway, this seems pretty useful to me. What do you think?
I'm not sure text-based alerts add enough value (we could instead have a doc or blog post showing how to use dvclive + https://github.com/liiight/notifiers). Full reports with plots seem like a more unique feature, and they extend dvclive's initial value prop of lightweight live monitoring for model training, providing serverless alerting and reporting anywhere without needing to access the training machine. Since a lot of training happens in headless environments anyway, this seems pretty useful to me. What do you think?
I think it's useful and would be directly adding value for DVCLive.
I'm a little "worried" about how easy would be to maintain because Report Providers sounds like integrations potentially growing perpendicular to ML Frameworks.
So far, looking at slack and email APIs, it doesn't look that bad.
@shcheklein mentioned that it might be worthwhile to look into RSS feed aggregators. There are some parallels in how RSS expects a particular schema of elements (https://validator.w3.org/feed/docs/rss2.html) and can publish them in a consistent format, so maybe it can give some ideas for how to implement.
So it's about tidying up this sort of thing? 
from tqdm.contrib.{slack,telegram,discord} import trange
with trange(live.get_step(), epochs, unit="epoch") as pbar:
for epoch in pbar:
...
live.log("loss", loss)
pbar.set_postfix(loss=loss)
live.next_step()
i.e. providing a callback interface?
live.set_callback(
on_log=lambda name, metric: pbar.set_postfix({name: metric}),
on_step=lambda new_step: print(f"starting epoch {new_step:>5d}", file=some_log))
Or is it more advanced? live.notify_slack(on_step=True, channel="#...", token="...")
Sorry @casperdcl, I missed this comment. It's closer to the latter advanced usage. Probably channel, token, etc. can be set in environment variables, and the method can be something like live.make_report(type="slack").
I don't think we are likely to do this now that we have live metrics in Studio and other solutions exist for alerting.