augur icon indicating copy to clipboard operation
augur copied to clipboard

feat: Add topic modeling worker with NMF and REPLACE strategy

Open xiaoha-cloud opened this issue 4 weeks ago • 2 comments

Description

  • This PR implements a complete topic modeling system split into worker core and frontend components
  • Implements message-level topic modeling using NMF with REPLACE strategy to prevent data growth
  • Adds 2 frontend routes and 7 API endpoints with proper permission control

Worker Core:

  • Implement message-level topic modeling using NMF (Non-negative Matrix Factorization) with CountVectorizer
  • Add intelligent retraining system based on 4 dimensions (Age, Params, Quality, Data growth)
  • Implement REPLACE strategy to prevent data growth in repo_topic and repo_cluster_messages tables (referenced from insight_worker pattern)
  • Convert raw SQL inserts to ORM bulk_save_objects for better maintainability
  • Store complete vocabulary in visualization_data JSONB field
  • Add model versioning, comparison, and event logging modules

Frontend:

  • Add 2 frontend routes and 7 API endpoints for topic modeling
  • Implement @login_required for read operations and @admin_only for write operations (train/optimize)
  • Add topic models list and detail page templates
  • Fix dashboard_view to use AugurConfig instead of undefined requestJson

Configuration:

  • Add config.json.example template with environment variable support
  • Fallback to database configuration if config file not present

The REPLACE strategy deletes old records matching repo_id, tool_source, and tool_version before inserting new data, preventing accumulation of historical topic modeling results.

This PR fixes #

Notes for Reviewers

REPLACE Strategy Implementation

To solve the data growth issue, I referenced the REPLACE strategy from insight_worker and applied it to:

  • repo_topic table (tasks.py lines 1111-1119)
  • repo_cluster_messages table (tasks.py lines 1149-1157)

The strategy deletes old records matching repo_id, tool_source, and tool_version before inserting new data.

Signed commits

  • [x] Yes, I signed my commits.

xiaoha-cloud avatar Nov 26 '25 18:11 xiaoha-cloud

Is this essentially #3214 but rebased and without the schemas that were already merged?

MoralCode avatar Dec 01 '25 14:12 MoralCode

Actually, this is the worker/core part of #3254, rebased onto the latest main and without the schema changes that were already merged. I intentionally split the original big PR into smaller.My plan is to follow up with additional small PRs for the remaining pieces, so they can be reviewed and discussed step by step.

xiaoha-cloud avatar Dec 03 '25 14:12 xiaoha-cloud