augur
augur copied to clipboard
Repository Discourse Analysis Pipeline Worker
We have several workers that store machine learning information derived from computational linguistic analysis of data in the message
table. The message table includes messages from issue, pull request, pull request review, and email messages. They are related to their origin with bridge tables like pull_request_message_ref
. The ML/CL workers are all run against all the messages, regardless of origin.
- Clustering Worker (clusters created and topics modeled)
- message analysis worker (sentiment and novelty analysis)
- discourse analysis worker (speech act classification (question, answer, approval, etc.)
Clustering Worker Notes:
Clustering Worker: 2 Models.
- Models:
- Topic modeling, but it needs a better way of estimating number of topics.
- Tables - repo_topic - topic_words
- Computational linguistic clustering
- Tables - repo_cluster_messages
- Key Needs
- Add GenSim algorithms to topic modeling section https://github.com/chaoss/augur/issues/1199
- The topics, and associated topic words need to be persisted after each run. At the moment, the topic words get overwritten for each topic modeling run.
- Description/optimization of the parameters used to create the computational linguistic clusters.
- Periodic deletion of models (heuristic: If 3 months pass, OR there’s a 10% increase in the messages, issues, or PRs in a repo, rebuild the models)
- Establish some kind of model archiving with appropriate metadata (lower priority)
Discourse Analysis Worker Notes:
discourse_insights table (select max(data_collection_date) for each msg_id)
- sequence is reassembled from the timestamp in the message table (look at msg_timestamp)
- issues_msg_ref, pull_request_message_ref, pull_request_review_msg_ref
Message Analysis Worker
- message_analysis
- message_analysis_summary
