code-intelligence [Label Bot Continuous Training] Needs Training Needs to take into account whether there is a model currently being trained

[Label Bot Continuous Training] Needs Training Needs to take into account whether there is a model currently being trained

Open jlewi opened this issue 5 years ago • 4 comments

trafficstars

Our synchronous training pipeline is currently spawning multiple instances of training rather than the expected 1 model per hour.

The problem appears to be the code to decide whether to train a model only looks at whether there is a trained model. So I don't think we take into account whether a model is currently being trained. https://github.com/kubeflow/code-intelligence/blob/faeb65757214ac93259f417b81e9e2fedafaebda/Label_Microservice/go/cmd/automl/pkg/automl/automl.go#L101

My conjecture is the following happens

We launch a Tekton job to train the model
The notebook loads the data into AutoML which is a blocking operatin
The notebook initiates an AutoML training job but doesn't block until training is complete
- This is intentional since we want to upload the notebook output and not wait for the AutoML job to complete.

At this point

A new model doesn't exist yet (it is still being trained)
needsTrain will continue to return true
Since there is no Tekton job running the controller will launch another job

Jul 26 '20 17:07 jlewi

Issue-Label Bot is automatically applying the labels:

Label	Probability
kind/bug	0.63

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.

Jul 26 '20 17:07 issue-label-bot[bot]

It looks like we need to also look at the datasets and see if there is a model training in progress.

Jul 26 '20 17:07 jlewi

#182 auto PR created for a model trained by manually running the notebook.

Need to verify that a new model is trained automatically and then deployed.

Oct 05 '20 13:10 jlewi

kubeflow/code-intelligence#184 opened a PR to update to the same model. It doesn't look like a new model got trained.

Oct 06 '20 14:10 jlewi

code-intelligence code-intelligence copied to clipboard

[Label Bot Continuous Training] Needs Training Needs to take into account whether there is a model currently being trained

code-intelligence
code-intelligence copied to clipboard