augur icon indicating copy to clipboard operation
augur copied to clipboard

feat: Complete Topic Modeling System with NMF and Advanced Features

Open xiaoha-cloud opened this issue 3 months ago • 6 comments

Description

  • This PR implements a comprehensive topic modeling versioning system for Augur, replacing LDA with NMF (Non-negative Matrix Factorization) and adding complete model lifecycle management, optimization, comparison, and export capabilities.

This PR fixes #

Notes for Reviewers

  1. Database Schema Changes:
  • Migration 34: Creates topic_model_meta table with 21 fields (all NOT NULL except repo_id, visualization_data)
  • Migration 35: Creates topic_model_event table for audit logging
  • Comprehensive model versioning with parameters_hash and data_fingerprint for duplicate detection image
  1. Core Algorithm Changes:
  • Replaced LDA with NMF (Non-negative Matrix Factorization) in clustering worker
  • Added TF-IDF + CountVectorizer preprocessing pipeline
  • Implemented NMF interpretability scoring and Gensim coherence calculation
  • Smart parameter adjustment for small datasets with fallback logic
  1. API Endpoints:
  • POST /topic-models/{repo_id}/train - Train new models
  • GET /topic-models/{repo_id}/status - Check training status
  • POST /topic-models/{repo_id}/optimize - Data-driven parameter optimization
  • GET /topic-models/{repo_id}/compare - Compare model performance
  • GET /topic-models/{repo_id}/visualization/{model_id} - Export visualization data
  1. Web Interface:
  • Complete topic models listing page with model metrics
  • Detailed model view with NMF quality scores and topic distributions
  • Interactive model comparison with coherence score analysis
  • Model export functionality with JSON download
  • ECharts visualizations for topic distributions and word clouds
  1. Intelligent Automatic Retraining System:
  • 4-Dimensional Retrain Decision Logic : Automatically determines when models need retraining based on:
  • Age : Model age vs configurable retrain_days threshold (default: 90 days)
  • Parameters : Training parameter changes detected via parameters_hash comparison
  • Quality : Model coherence score vs configurable quality_threshold (default: 0.3)
  • Data Growth : Message count growth vs configurable retrain_msg_growth threshold (default: 20%)
  • Automated Workflow : Celery tasks automatically check retrain conditions without manual intervention
  • Smart Decision Making : Uses existing models when all conditions are satisfied, retrains only when necessary
  • Configuration-Driven : All thresholds configurable via config.json or database settings Pasted Graphic 2

Signed commits

  • [x] Yes, I signed my commits.

xiaoha-cloud avatar Aug 28 '25 22:08 xiaoha-cloud

just dropping a note that the proposal this is in reference to is https://summerofcode.withgoogle.com/programs/2025/projects/WEvhcxii

MoralCode avatar Sep 11 '25 20:09 MoralCode

@xiaoha-cloud : We are ready to begin testing. Are. you available for questions?

sgoggins avatar Oct 16 '25 19:10 sgoggins

@xiaoha-cloud How much damage could someone cause if they were able to access the routes that require auth?

For instances that choose to make their augur frontends public, i'm fairly sure that anyone can just register, and then they would have access to these routes that allow adding/deleting/regenerating topic models (seemingly expensive/destructive actions).

Does Augur have a mechanism for more granular permissions or some kind of role-based access control to ensure that only users designated as augur admins can trigger compute-intensive actions, flood the DB, or delete stuff?

MoralCode avatar Oct 20 '25 20:10 MoralCode

CI end to end test for docker is failing with this error:

augur-db-1      | 2025-10-20 20:39:29.767 UTC [61] ERROR:  invalid input syntax for type bigint: ""
augur-db-1      | 2025-10-20 20:39:29.767 UTC [61] CONTEXT:  COPY subscription_types, line 1, column id: ""
augur-db-1      | 2025-10-20 20:39:29.767 UTC [61] STATEMENT:  COPY augur_operations.subscription_types (id, name) FROM stdin;
augur-db-1      | psql:/docker-entrypoint-initdb.d/augur-new-schema.sql:7384: ERROR:  invalid input syntax for type bigint: ""
augur-db-1      | CONTEXT:  COPY subscription_types, line 1, column id: ""

augur-db-1 exited with code 3

MoralCode avatar Oct 20 '25 20:10 MoralCode

@xiaoha-cloud How much damage could someone cause if they were able to access the routes that require auth?

For instances that choose to make their augur frontends public, i'm fairly sure that anyone can just register, and then they would have access to these routes that allow adding/deleting/regenerating topic models (seemingly expensive/destructive actions).

Does Augur have a mechanism for more granular permissions or some kind of role-based access control to ensure that only users designated as augur admins can trigger compute-intensive actions, flood the DB, or delete stuff?

I took a look at the code and found a couple of security gaps: Problem 1: Registration vulnerability In routes.py line 146, the registration accepts an admin parameter from the form. While the frontend login page doesn't have this field, someone could just modify the HTTP request and add admin=true when registering to get admin privileges. Problem 2: Topic Modeling routes don't check permissions Right now these routes only have @login_required, no admin check. So any registered user can train models, delete models, trigger optimization - which could definitely flood the DB or delete stuff. My solution: I'm planning to fix this in two steps: Change that registration line to force admin = False so people can't self-promote to admin through the web UI. Admins should only be created via CLI using augur user add --admin. Create an @admin_required decorator that checks the current_user.admin field, then apply it to the risky routes like training, deleting, and optimizing models. This way only actual admins can trigger compute-intensive operations.

xiaoha-cloud avatar Oct 21 '25 09:10 xiaoha-cloud

After thinking about it, I went with a simpler approach that I think is more practical.I added an @admin_only decorator that restricts the expensive write operations (training and optimizing models) to admin users only. Read operations like viewing, exporting, and comparing models are still available to any authenticated user. This way we prevent unauthorized users from flooding the database or triggering compute-intensive tasks, while still letting people access and analyze the data.I decided not to modify the registration system since that's a separate concern and could introduce other issues. This approach keeps things focused and maintainable - if we need more granular permissions later, it's easy to extend.The code is in commit ed3530bfc, specifically applied to the /topic-models/<repo_id>/train and /topic-models/<repo_id>/optimize routes. Would appreciate review on this approach.

xiaoha-cloud avatar Oct 29 '25 11:10 xiaoha-cloud