MLOpsPython
MLOpsPython copied to clipboard
Bug: metrics not set on the pipeline if train is reused, which then breaks the pipeline
So I'm facing an issue where about half the time this line is failing because run.parent.get_metrics()
is empty. Sometimes it works, sometimes it doesn't - so it seems like a timing thing. If I print run.parent.get_metrics()
here it's always empty (even when the overall pipeline runs) - so it's definitely not an instantaneous thing. While reviewing in the portal, it seems that the training step always gets a metric, it's just the overall pipeline step that doesn't (sometimes).
The only other weird thing I can note is that the timestamps are wrong in the log files - I haven't dug fully, but sometimes they're out by e.g. 45 minutes, and not consistently (e.g. the eval step logs are right, but not the train). I think once eval timestamps were before train.
NB - not using the diabetes example, but it's just a trivial sklearn DummyRegressor
. And using a local build agent (to fix waiting 8 minutes each pipeline for container startup). However, it does work some of the time, so I don't think it's an issue with these.
Example run ID (if the Azure team can use this for debugging): 80befea6-7e33-48d6-a8e5-a8b718dad88c
Side note: this code never gets evaluated as I don't believe float
ever returns None
.
OK, figured it out - here's the culprit. If the training step is re-used, then the run.parent.log
will never get called, and hence the issue I'm having will always occur. The easy fix - set this to False
. A better fix (?) - move the run.parent.log
stuff outside of the train script, so that things get updated appropriately. Or - don't use run.parent.get_metrics
, and instead use (with some catches) [i for i in run.parent.get_children() if i.name == 'Training Run'][0].get_metrics
.
The only other weird thing I can note is that the timestamps are wrong in the log files
Right, this is because the train step was cached.