augur icon indicating copy to clipboard operation
augur copied to clipboard

Implement the hdpmodel module from gensim for topic modeling in the clustering_worker

Open sgoggins opened this issue 3 years ago • 3 comments

There is significant diversity in the optimal number of topics in a topic model for any given collection of repositories. The hdpmodel module from gensim accommodates this diversity by determining the optimal number of topics for a given collection. Currently, it is preset, and could easily be configurable. However, the number is going to be different for each collection of repositories we do topic modeling for; ergo, and automated way to determine that number is a better approach than what Augur currently does.

Potential solutions: The description in this post makes it clear that the hdpmodel from gensim will be a better approach, though other implementations of this optimization exist: https://www.kaggle.com/akashram/topic-modeling-intro-implementation

@sarit-adh may be able to do this as the original author of the worker.

The current version of the worker is in the main branch, though will likely be in the release on Monday, April 8th, 2022.

sgoggins avatar Mar 20 '21 20:03 sgoggins

Here's another useful link: https://towardsdatascience.com/unsupervised-nlp-topic-models-as-a-supervised-learning-input-cf8ee9e5cf28

sgoggins avatar Mar 20 '21 21:03 sgoggins

Another useful link: https://nlpforhackers.io/topic-modeling/

sgoggins avatar Mar 20 '21 21:03 sgoggins

And Another: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/#15visualizethetopicskeywords

sgoggins avatar Mar 20 '21 21:03 sgoggins

I want to work on this issue, I have never used topic modelling. Where can i find the data for the model? @sgoggins (sorry for the ping :))

WhiteWolf47 avatar Dec 25 '22 12:12 WhiteWolf47

@sgoggins lets discuss this issue!

WhiteWolf47 avatar Jan 18 '23 16:01 WhiteWolf47

@WhiteWolf47 : I think the most significant difference in conceptualization of this is our switch from a "worker" model in the software architecture to a "task" model. In the current release of Augur this code is located at augur/tasks/data_analysis/message_insights

sgoggins avatar Jan 19 '23 16:01 sgoggins

@sgoggins okay, so we are looking forward to replace the current worker model with a hdpmodel (task model). I dont really know how we get the data and what we do with the analysed data, I have searched about hdpmodel and tried its basic implementation , so lets discuss this in depth on saturday (during the CHAOSS Software meeting)

WhiteWolf47 avatar Jan 19 '23 16:01 WhiteWolf47

@WhiteWolf47 : if you are still interested I'll tag this as a first-timers-only issue which gets you direct support and access to the maintainers.

sgoggins avatar Apr 07 '23 17:04 sgoggins

Turns out this is not the best approach we could come up with! #fixed.

sgoggins avatar Jul 18 '23 11:07 sgoggins