code-intelligence icon indicating copy to clipboard operation
code-intelligence copied to clipboard

[Label Bot Continuous Training] Needs Training Needs to take into account whether there is a model currently being trained

Open jlewi opened this issue 5 years ago • 4 comments
trafficstars

Our synchronous training pipeline is currently spawning multiple instances of training rather than the expected 1 model per hour.

The problem appears to be the code to decide whether to train a model only looks at whether there is a trained model. So I don't think we take into account whether a model is currently being trained. https://github.com/kubeflow/code-intelligence/blob/faeb65757214ac93259f417b81e9e2fedafaebda/Label_Microservice/go/cmd/automl/pkg/automl/automl.go#L101

My conjecture is the following happens

  • We launch a Tekton job to train the model
  • The notebook loads the data into AutoML which is a blocking operatin
  • The notebook initiates an AutoML training job but doesn't block until training is complete
    • This is intentional since we want to upload the notebook output and not wait for the AutoML job to complete.

At this point

  • A new model doesn't exist yet (it is still being trained)
  • needsTrain will continue to return true
  • Since there is no Tekton job running the controller will launch another job

jlewi avatar Jul 26 '20 17:07 jlewi

Issue-Label Bot is automatically applying the labels:

Label Probability
kind/bug 0.63

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.

issue-label-bot[bot] avatar Jul 26 '20 17:07 issue-label-bot[bot]

It looks like we need to also look at the datasets and see if there is a model training in progress.

jlewi avatar Jul 26 '20 17:07 jlewi

#182 auto PR created for a model trained by manually running the notebook.

Need to verify that a new model is trained automatically and then deployed.

jlewi avatar Oct 05 '20 13:10 jlewi

kubeflow/code-intelligence#184 opened a PR to update to the same model. It doesn't look like a new model got trained.

jlewi avatar Oct 06 '20 14:10 jlewi