code-intelligence
code-intelligence copied to clipboard
[Label Bot Continuous Training] Needs Training Needs to take into account whether there is a model currently being trained
Our synchronous training pipeline is currently spawning multiple instances of training rather than the expected 1 model per hour.
The problem appears to be the code to decide whether to train a model only looks at whether there is a trained model. So I don't think we take into account whether a model is currently being trained. https://github.com/kubeflow/code-intelligence/blob/faeb65757214ac93259f417b81e9e2fedafaebda/Label_Microservice/go/cmd/automl/pkg/automl/automl.go#L101
My conjecture is the following happens
- We launch a Tekton job to train the model
- The notebook loads the data into AutoML which is a blocking operatin
- The notebook initiates an AutoML training job but doesn't block until training is complete
- This is intentional since we want to upload the notebook output and not wait for the AutoML job to complete.
At this point
- A new model doesn't exist yet (it is still being trained)
- needsTrain will continue to return true
- Since there is no Tekton job running the controller will launch another job
Issue-Label Bot is automatically applying the labels:
| Label | Probability |
|---|---|
| kind/bug | 0.63 |
Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.
It looks like we need to also look at the datasets and see if there is a model training in progress.
#182 auto PR created for a model trained by manually running the notebook.
Need to verify that a new model is trained automatically and then deployed.
kubeflow/code-intelligence#184 opened a PR to update to the same model. It doesn't look like a new model got trained.