code-intelligence
code-intelligence copied to clipboard
Periodically retrain the Kubeflow Label model
New data is constantly arriving as GitHub issues that have been labeled by humans or the label bot.
We would like to periodically retrain our model to benefit from this new data.
To do this we first need to
- Create a pipeline to train a model on all the data in Kubeflow #110
- Use KFP to periodically (once a day) run this pipeline to train a new model
- Update the label bot so that it uses the latest trained model.
#110 has pointers to lots of the appropriate code locations.
Follow on to #70 and #110
Issue-Label Bot is automatically applying the labels:
Label | Probability |
---|---|
kind/feature | 0.92 |
Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.
Issue-Label Bot is automatically applying the labels:
Label | Probability |
---|---|
kind/feature | 0.92 |
Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.
Is the Github issues with the human labeled vs the label bot currently scraped and stored or do this needs to be done as part of this issue as well?
@jlewi I would like to work on this issue as a part of GSoC 2020. I have experience with Python, Tensorflow and Kubernetes
@NikeNano currently we do not have any code which measures the actual accuracy of the label bot by looking at whether the labels applied by the label bot were later changed by a human. I think that would be interesting data to collect as part of measuring the efficacy of the bot.
I do not think that is strictly necessary for this issue; but if its something you'd like to work on feel free to submit a proposal.
@asif001 that's great if your interested I would suggest starting to draft a proposal to submit to GSOC.
I think the key questions to answer would be
- What would the deliverables be?
- e.g. A pipeline, docker images, models, etc...
- How would you go about producing those deliverables?
- What resources you will need?
- e.g. a Kubeflow cluster to train models on etc...
/area gsoc
How much more data do you get in a day? Wouldn't it be prohibitively expensive to train a new model every day? Could a gmm be used instead of a neural network? I'm guessing these are all naive questions and I'll find my answers with a bit more searching (I'm also interested in this as a gsoc project)
@AnkilP you are probably right retraining every day may not make the most sense w.r.t to the amount of new data. You could look at the number of issues KF gets each day. The primary reason to start off trainining with high frequency is to test that it is working reliably. Once we know it is working reliably we can reduce the frequency to a sensible amount.
Iterating on the model is also a good idea. I think though we would get more benefit in the short term by increasing retraining and then once thats done iterating on other possible models.