code-intelligence Periodically retrain the Kubeflow Label model

trafficstars

New data is constantly arriving as GitHub issues that have been labeled by humans or the label bot.

We would like to periodically retrain our model to benefit from this new data.

To do this we first need to

Create a pipeline to train a model on all the data in Kubeflow #110
Use KFP to periodically (once a day) run this pipeline to train a new model
Update the label bot so that it uses the latest trained model.

#110 has pointers to lots of the appropriate code locations.

Follow on to #70 and #110

Feb 15 '20 01:02 jlewi

Issue-Label Bot is automatically applying the labels:

Label	Probability
kind/feature	0.92

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.

Feb 15 '20 01:02 issue-label-bot[bot]

Issue-Label Bot is automatically applying the labels:

Label	Probability
kind/feature	0.92

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.

Feb 15 '20 01:02 kf-label-bot-dev[bot]

Is the Github issues with the human labeled vs the label bot currently scraped and stored or do this needs to be done as part of this issue as well?

Mar 13 '20 11:03 NikeNano

@jlewi I would like to work on this issue as a part of GSoC 2020. I have experience with Python, Tensorflow and Kubernetes

Mar 15 '20 06:03 masifur

@NikeNano currently we do not have any code which measures the actual accuracy of the label bot by looking at whether the labels applied by the label bot were later changed by a human. I think that would be interesting data to collect as part of measuring the efficacy of the bot.

I do not think that is strictly necessary for this issue; but if its something you'd like to work on feel free to submit a proposal.

@asif001 that's great if your interested I would suggest starting to draft a proposal to submit to GSOC.

I think the key questions to answer would be

What would the deliverables be?
- e.g. A pipeline, docker images, models, etc...
How would you go about producing those deliverables?
What resources you will need?
- e.g. a Kubeflow cluster to train models on etc...

Mar 15 '20 20:03 jlewi

/area gsoc

Mar 19 '20 21:03 sarahmaddox

How much more data do you get in a day? Wouldn't it be prohibitively expensive to train a new model every day? Could a gmm be used instead of a neural network? I'm guessing these are all naive questions and I'll find my answers with a bit more searching (I'm also interested in this as a gsoc project)

Mar 31 '20 14:03 AnkilP

@AnkilP you are probably right retraining every day may not make the most sense w.r.t to the amount of new data. You could look at the number of issues KF gets each day. The primary reason to start off trainining with high frequency is to test that it is working reliably. Once we know it is working reliably we can reduce the frequency to a sensible amount.

Iterating on the model is also a good idea. I think though we would get more benefit in the short term by increasing retraining and then once thats done iterating on other possible models.

Apr 02 '20 13:04 jlewi

code-intelligence code-intelligence copied to clipboard

Periodically retrain the Kubeflow Label model

code-intelligence
code-intelligence copied to clipboard