MLOpsPython icon indicating copy to clipboard operation
MLOpsPython copied to clipboard

Bug: metrics not set on the pipeline if train is reused, which then breaks the pipeline

Open kodonnell opened this issue 4 years ago • 1 comments

So I'm facing an issue where about half the time this line is failing because run.parent.get_metrics() is empty. Sometimes it works, sometimes it doesn't - so it seems like a timing thing. If I print run.parent.get_metrics() here it's always empty (even when the overall pipeline runs) - so it's definitely not an instantaneous thing. While reviewing in the portal, it seems that the training step always gets a metric, it's just the overall pipeline step that doesn't (sometimes).

The only other weird thing I can note is that the timestamps are wrong in the log files - I haven't dug fully, but sometimes they're out by e.g. 45 minutes, and not consistently (e.g. the eval step logs are right, but not the train). I think once eval timestamps were before train.

NB - not using the diabetes example, but it's just a trivial sklearn DummyRegressor. And using a local build agent (to fix waiting 8 minutes each pipeline for container startup). However, it does work some of the time, so I don't think it's an issue with these.

Example run ID (if the Azure team can use this for debugging): 80befea6-7e33-48d6-a8e5-a8b718dad88c

Side note: this code never gets evaluated as I don't believe float ever returns None.

kodonnell avatar Jun 10 '20 10:06 kodonnell

OK, figured it out - here's the culprit. If the training step is re-used, then the run.parent.log will never get called, and hence the issue I'm having will always occur. The easy fix - set this to False. A better fix (?) - move the run.parent.log stuff outside of the train script, so that things get updated appropriately. Or - don't use run.parent.get_metrics, and instead use (with some catches) [i for i in run.parent.get_children() if i.name == 'Training Run'][0].get_metrics.

The only other weird thing I can note is that the timestamps are wrong in the log files

Right, this is because the train step was cached.

kodonnell avatar Jun 10 '20 11:06 kodonnell