augur Repository Discourse Analysis Pipeline Worker

Repository Discourse Analysis Pipeline Worker

Open sgoggins opened this issue 3 years ago • 0 comments

We have several workers that store machine learning information derived from computational linguistic analysis of data in the message table. The message table includes messages from issue, pull request, pull request review, and email messages. They are related to their origin with bridge tables like pull_request_message_ref. The ML/CL workers are all run against all the messages, regardless of origin.

Clustering Worker (clusters created and topics modeled)
message analysis worker (sentiment and novelty analysis)
discourse analysis worker (speech act classification (question, answer, approval, etc.)

Clustering Worker Notes:

Clustering Worker: 2 Models.

Models:
Topic modeling, but it needs a better way of estimating number of topics.
Tables - repo_topic - topic_words
Computational linguistic clustering
Tables - repo_cluster_messages
Key Needs
- Add GenSim algorithms to topic modeling section https://github.com/chaoss/augur/issues/1199
The topics, and associated topic words need to be persisted after each run. At the moment, the topic words get overwritten for each topic modeling run.
Description/optimization of the parameters used to create the computational linguistic clusters.
Periodic deletion of models (heuristic: If 3 months pass, OR there’s a 10% increase in the messages, issues, or PRs in a repo, rebuild the models)
Establish some kind of model archiving with appropriate metadata (lower priority)

Discourse Analysis Worker Notes:

discourse_insights table (select max(data_collection_date) for each msg_id)

sequence is reassembled from the timestamp in the message table (look at msg_timestamp)
issues_msg_ref, pull_request_message_ref, pull_request_review_msg_ref

Message Analysis Worker

message_analysis
message_analysis_summary

Jul 07 '21 16:07 sgoggins

augur augur copied to clipboard

Repository Discourse Analysis Pipeline Worker

augur
augur copied to clipboard