feat: Add topic modeling worker with NMF and REPLACE strategy
Description
- This PR implements a complete topic modeling system split into worker core and frontend components
- Implements message-level topic modeling using NMF with REPLACE strategy to prevent data growth
- Adds 2 frontend routes and 7 API endpoints with proper permission control
Worker Core:
- Implement message-level topic modeling using NMF (Non-negative Matrix Factorization) with CountVectorizer
- Add intelligent retraining system based on 4 dimensions (Age, Params, Quality, Data growth)
- Implement REPLACE strategy to prevent data growth in repo_topic and repo_cluster_messages tables (referenced from insight_worker pattern)
- Convert raw SQL inserts to ORM bulk_save_objects for better maintainability
- Store complete vocabulary in visualization_data JSONB field
- Add model versioning, comparison, and event logging modules
Frontend:
- Add 2 frontend routes and 7 API endpoints for topic modeling
- Implement @login_required for read operations and @admin_only for write operations (train/optimize)
- Add topic models list and detail page templates
- Fix dashboard_view to use AugurConfig instead of undefined requestJson
Configuration:
- Add config.json.example template with environment variable support
- Fallback to database configuration if config file not present
The REPLACE strategy deletes old records matching repo_id, tool_source, and tool_version before inserting new data, preventing accumulation of historical topic modeling results.
This PR fixes #
Notes for Reviewers
REPLACE Strategy Implementation
To solve the data growth issue, I referenced the REPLACE strategy from insight_worker and applied it to:
repo_topictable (tasks.py lines 1111-1119)repo_cluster_messagestable (tasks.py lines 1149-1157)
The strategy deletes old records matching repo_id, tool_source, and tool_version before inserting new data.
Signed commits
- [x] Yes, I signed my commits.
Is this essentially #3214 but rebased and without the schemas that were already merged?
Actually, this is the worker/core part of #3254, rebased onto the latest main and without the schema changes that were already merged. I intentionally split the original big PR into smaller.My plan is to follow up with additional small PRs for the remaining pieces, so they can be reviewed and discussed step by step.