augur feat: Complete Topic Modeling System with NMF and Advanced Features

Description

This PR implements a comprehensive topic modeling versioning system for Augur, replacing LDA with NMF (Non-negative Matrix Factorization) and adding complete model lifecycle management, optimization, comparison, and export capabilities.

This PR fixes #

Notes for Reviewers

Database Schema Changes:

Migration 34: Creates topic_model_meta table with 21 fields (all NOT NULL except repo_id, visualization_data)
Migration 35: Creates topic_model_event table for audit logging
Comprehensive model versioning with parameters_hash and data_fingerprint for duplicate detection

Core Algorithm Changes:

Replaced LDA with NMF (Non-negative Matrix Factorization) in clustering worker
Added TF-IDF + CountVectorizer preprocessing pipeline
Implemented NMF interpretability scoring and Gensim coherence calculation
Smart parameter adjustment for small datasets with fallback logic

API Endpoints:

POST /topic-models/{repo_id}/train - Train new models
GET /topic-models/{repo_id}/status - Check training status
POST /topic-models/{repo_id}/optimize - Data-driven parameter optimization
GET /topic-models/{repo_id}/compare - Compare model performance
GET /topic-models/{repo_id}/visualization/{model_id} - Export visualization data

Web Interface:

Complete topic models listing page with model metrics
Detailed model view with NMF quality scores and topic distributions
Interactive model comparison with coherence score analysis
Model export functionality with JSON download
ECharts visualizations for topic distributions and word clouds

Intelligent Automatic Retraining System:

4-Dimensional Retrain Decision Logic : Automatically determines when models need retraining based on:
Age : Model age vs configurable retrain_days threshold (default: 90 days)
Parameters : Training parameter changes detected via parameters_hash comparison
Quality : Model coherence score vs configurable quality_threshold (default: 0.3)
Data Growth : Message count growth vs configurable retrain_msg_growth threshold (default: 20%)
Automated Workflow : Celery tasks automatically check retrain conditions without manual intervention
Smart Decision Making : Uses existing models when all conditions are satisfied, retrains only when necessary
Configuration-Driven : All thresholds configurable via config.json or database settings

Signed commits

[x] Yes, I signed my commits.

Aug 28 '25 22:08 xiaoha-cloud

just dropping a note that the proposal this is in reference to is https://summerofcode.withgoogle.com/programs/2025/projects/WEvhcxii

Sep 11 '25 20:09 MoralCode

@xiaoha-cloud : We are ready to begin testing. Are. you available for questions?

Oct 16 '25 19:10 sgoggins

@xiaoha-cloud How much damage could someone cause if they were able to access the routes that require auth?

For instances that choose to make their augur frontends public, i'm fairly sure that anyone can just register, and then they would have access to these routes that allow adding/deleting/regenerating topic models (seemingly expensive/destructive actions).

Does Augur have a mechanism for more granular permissions or some kind of role-based access control to ensure that only users designated as augur admins can trigger compute-intensive actions, flood the DB, or delete stuff?

Oct 20 '25 20:10 MoralCode

CI end to end test for docker is failing with this error:

augur-db-1      | 2025-10-20 20:39:29.767 UTC [61] ERROR:  invalid input syntax for type bigint: ""
augur-db-1      | 2025-10-20 20:39:29.767 UTC [61] CONTEXT:  COPY subscription_types, line 1, column id: ""
augur-db-1      | 2025-10-20 20:39:29.767 UTC [61] STATEMENT:  COPY augur_operations.subscription_types (id, name) FROM stdin;
augur-db-1      | psql:/docker-entrypoint-initdb.d/augur-new-schema.sql:7384: ERROR:  invalid input syntax for type bigint: ""
augur-db-1      | CONTEXT:  COPY subscription_types, line 1, column id: ""

augur-db-1 exited with code 3

Oct 20 '25 20:10 MoralCode

@xiaoha-cloud How much damage could someone cause if they were able to access the routes that require auth?

For instances that choose to make their augur frontends public, i'm fairly sure that anyone can just register, and then they would have access to these routes that allow adding/deleting/regenerating topic models (seemingly expensive/destructive actions).

Does Augur have a mechanism for more granular permissions or some kind of role-based access control to ensure that only users designated as augur admins can trigger compute-intensive actions, flood the DB, or delete stuff?

I took a look at the code and found a couple of security gaps: Problem 1: Registration vulnerability In routes.py line 146, the registration accepts an admin parameter from the form. While the frontend login page doesn't have this field, someone could just modify the HTTP request and add admin=true when registering to get admin privileges. Problem 2: Topic Modeling routes don't check permissions Right now these routes only have @login_required, no admin check. So any registered user can train models, delete models, trigger optimization - which could definitely flood the DB or delete stuff. My solution: I'm planning to fix this in two steps: Change that registration line to force admin = False so people can't self-promote to admin through the web UI. Admins should only be created via CLI using augur user add --admin. Create an @admin_required decorator that checks the current_user.admin field, then apply it to the risky routes like training, deleting, and optimizing models. This way only actual admins can trigger compute-intensive operations.

Oct 21 '25 09:10 xiaoha-cloud

After thinking about it, I went with a simpler approach that I think is more practical.I added an @admin_only decorator that restricts the expensive write operations (training and optimizing models) to admin users only. Read operations like viewing, exporting, and comparing models are still available to any authenticated user. This way we prevent unauthorized users from flooding the database or triggering compute-intensive tasks, while still letting people access and analyze the data.I decided not to modify the registration system since that's a separate concern and could introduce other issues. This approach keeps things focused and maintainable - if we need more granular permissions later, it's easy to extend.The code is in commit ed3530bfc, specifically applied to the /topic-models/<repo_id>/train and /topic-models/<repo_id>/optimize routes. Would appreciate review on this approach.

Oct 29 '25 11:10 xiaoha-cloud