Add mlforgex

Open dhgefergfefruiwefhjhcduc opened this issue 4 months ago • 1 comments

What is mlforgex?

mlforgex is an end-to-end machine learning automation package for Python that eliminates manual setup and complexity. It automates the entire ML workflow—from data preprocessing to model deployment—enabling you to train, evaluate, and make predictions with minimal effort.

Main Features

Automatic Data Preprocessing: Handles missing values, outliers, duplicate removal, categorical encoding, scaling, and multicollinearity (VIF) detection automatically.
Automatic Problem Detection: Intelligently detects whether your task is classification (binary/multiclass), regression, or NLP—no manual configuration needed.
NLP Pipeline: Full text preprocessing pipeline with tokenization, stopword removal, lemmatization, Word2Vec vectorization, and saved artifacts for reproducible inference.
Imbalanced Data Handling: Auto-detects class imbalance and applies SMOTE or under-sampling inside cross-validation folds to prevent data leakage.
Model Training & Selection: Trains a curated pool of candidate models and automatically selects the best performer using composite scoring (customizable F1/RMSE weights).
Hyperparameter Tuning: RandomizedSearchCV with configurable iterations and cross-validation folds; optional fast mode skips tuning for rapid prototyping.
Interactive Dashboard: Single HTML dashboard (Dashboard.html) aggregates all Plotly visualizations, metrics, model comparison table, and run configuration.
Reproducible Artifacts: Saves model, preprocessing pipeline, encoders, Word2Vec models, and metrics for production deployment and full reproducibility.
Visualizations: Correlation heatmap, confusion matrix, ROC/Precision-Recall curves, learning curves, residual plots, feature importance, and word clouds.
CLI & Python API: Dual interface—use command-line for quick training or Python API for programmatic control.

What's the difference between mlforgex and similar AutoML tools?

Unlike generic AutoML packages, mlforgex offers:

1. Unified NLP Support

Most AutoML tools lack native NLP pipelines. mlforgex includes Word2Vec vectorization, text preprocessing, and saved models ready for production inference.

2. Single Interactive Dashboard

Instead of scattered plot files, mlforgex generates one polished Dashboard.html with responsive Plotly charts, model comparison table, and run metadata—no manual aggregation needed.

3. Smart Problem Auto-Detection

Automatically distinguishes classification vs regression without user hints. Detects binary vs multiclass and adjusts metrics accordingly.

4. Leak-Free Resampling

Cross-validation resampling (SMOTE) occurs inside training folds only, preventing data leakage—a common mistake in AutoML pipelines.

5. Customizable Composite Scoring

Rank models by weighted combinations of metrics (f1_prob, rmse_prob), not just a single metric. Fine-tune model selection to your use case.

6. Fast Mode

Optional --fast flag skips hyperparameter tuning and uses robust defaults—perfect for rapid iteration when compute is limited.

7. Complete Artifact Reproducibility

Saves preprocessing pipeline, encoders, Word2Vec models, and metrics so predictions on new data use the exact same pipeline as training—true reproducibility.

8. Minimal Configuration, Maximum Control

Sensible defaults work out-of-the-box (one command trains a full pipeline), but advanced users can tweak preprocessing, tuning iterations, cross-validation folds, and NLP settings via flags.

9. No Hidden Magic

Clear, documented preprocessing steps and model selection logic—you know exactly what the package is doing at each stage.

10. Production-Ready Output

Generates serialized models, metrics reports, and dashboards immediately deployable to production environments or data science presentations.

Aug 22 '25 14:08 dhgefergfefruiwefhjhcduc