augur
augur copied to clipboard
Implement the hdpmodel module from gensim for topic modeling in the clustering_worker
There is significant diversity in the optimal number of topics in a topic model for any given collection of repositories. The hdpmodel module from gensim accommodates this diversity by determining the optimal number of topics for a given collection. Currently, it is preset, and could easily be configurable. However, the number is going to be different for each collection of repositories we do topic modeling for; ergo, and automated way to determine that number is a better approach than what Augur currently does.
Potential solutions: The description in this post makes it clear that the hdpmodel from gensim will be a better approach, though other implementations of this optimization exist: https://www.kaggle.com/akashram/topic-modeling-intro-implementation
@sarit-adh may be able to do this as the original author of the worker.
The current version of the worker is in the main
branch, though will likely be in the release on Monday, April 8th, 2022.
Here's another useful link: https://towardsdatascience.com/unsupervised-nlp-topic-models-as-a-supervised-learning-input-cf8ee9e5cf28
Another useful link: https://nlpforhackers.io/topic-modeling/
And Another: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/#15visualizethetopicskeywords
I want to work on this issue, I have never used topic modelling. Where can i find the data for the model? @sgoggins (sorry for the ping :))
@sgoggins lets discuss this issue!
@WhiteWolf47 : I think the most significant difference in conceptualization of this is our switch from a "worker" model in the software architecture to a "task" model. In the current release of Augur this code is located at augur/tasks/data_analysis/message_insights
@sgoggins okay, so we are looking forward to replace the current worker model with a hdpmodel (task model). I dont really know how we get the data and what we do with the analysed data, I have searched about hdpmodel and tried its basic implementation , so lets discuss this in depth on saturday (during the CHAOSS Software meeting)
@WhiteWolf47 : if you are still interested I'll tag this as a first-timers-only issue which gets you direct support and access to the maintainers.
Turns out this is not the best approach we could come up with! #fixed.