MachineLearningPractice
MachineLearningPractice copied to clipboard
Some practices using statistical machine learning technique based on some dataset. (notes and doing from scratch)
Machine Learning Practice
Some practices using statistical machine learning technique based on some dataset.
To see more detail or example about deep learning, you can checkout my Deep Learning repository.
Because Github don't support LaTeX for now, you can use the Google Chrome extension TeX All the Things (github) to read the notes.
Environment
- Using Python 3
(most of the relative path links are according to the repository root)
Dependencies
-
numpy
: For low-level math operations -
pandas
: For data manipulation -
sklearn
- Scikit Learn: For evaluation metrics, some data preprocessing
For comparison purpose
-
sklearn
: For machine learning models -
cvxopt
: For convex optimization problem (for SVM) - For gradient boosting
For visualization
-
Mlxtend
-
matplotlib
-
matplotlib.pyplot
-
mpl_toolkits.mplot3d
-
For evaluation
-
surprise
: A Python scikit building and analyzing recommender systems
NLP related
-
gensim
: Topic Modelling -
hmmlearn
: Hidden Markov Models in Python, with scikit-learn like API -
jieba
: Chinese text segementation library -
pyHanLP
: Chinese NLP library (Python API) -
nltk
: Natural Language Toolkit
Projects
Subject | Technique / Task | Dataset | Solution | Notes |
---|---|---|---|---|
Letter Recognition | kNN / Classification | Letter Recognition Datasets (File) | kNN From Scratch, kNN Scikit Learn | Notes |
Page Blocks Classification | Decision Tree / Classification | Page Blocks Classification Data Set (File) | Decision Tree (CART) From Scratch, Decision Tree Scikit Learn | Notes |
CSM | Linear Regression / Regression | CSM Dataset (2014 and 2015) (File) | Linear Regression From Scratch, Linear Regression Scikit Learn, Linear Regression PyTorch NN | Notes |
Nursery | Naive Bayes / Classification | Nursery Data Set (File) | Gaussian Naive Bayes From Scratch, Gaussian Naive Bayes Scikit Learn | Notes |
Post-Operative Patient | SVM (cvxopt) / Binary Classification | Post-Operative Patient Data Set (File, Simplified) | SVM From Scratch (using cvxopt and simplified dataset), SVM Scikit Learn | Notes |
Student Performance | AdaBoost / Classification | Student Performance Data Set (File) | AdaBoost From Scratch, AdaBoost Scikit Learn | Notes |
Sales Transactions | k-Means / Clustering | Sales Transactions Dataset Weekly (File) | k-Means From Scratch, k-Means Scikit Learn | Notes |
Frequent Itemset Mining | FP-Growth / Frequent Itemsets Mining | Retail Market Basket Data Set (File) | FP-Growth From Scratch | Notes |
Automobile | PCA / Dimensionality Reduction | Automobile Data Set (File) | PCA From Scratch, PCA Scikit Learn | Notes |
Anonymous Microsoft Web Data | SVD / Recommendation System | Anonymous Microsoft Web Data Data Set (File, Ratings Matrix (by R)) | SVD From Scratch, R Notebook - IBCF Recommender System | Notes |
Handwriting Digit | SVM (SMO) / Binary & Multi-class Classification | MNIST (File) | Binary SVM From Scratch, Multi-class (OVR) SVM From Scratch | Notes |
Chinese Text Segmentation | HMM (EM) / Text Segmentation & POS Tagging | File | HMM From Scratch, HMM hmmlearn, Compare with Jieba and HanLP | - |
Document Similarity and LSI | VSM, SVD / LSI | Corpus of the People's Daily (File) | VSM From Scratch, VSM Gensim, SVD/LSI Gensim | Notes |
Click and Conversion Prediction | Logistic Regression / Recommendation System | Ali-CCP (File too large about 20GB) | Notes | |
LightGBM & XGBoost & CatBoost Practice | Boosting Tree / Classification | Social Network Ads (File) | LightGBM, XGBoost | Notes |
Kaggle Elo | LightGBM / Feature Engineering | Elo Merchant Category Recommendation | LightGBM Project | |
DCIC 2019 | LXGBoost / Feature Engineering | Failure Prediction of Concrete Piston for Concrete Pump Vehicles | XGBoost Project | |
Epinions CLiMF | Collaborative Filtering / Recommendation System | Epinions | CLiMF From Scratch, CLiMF TensorFlow | Notes, PaperResearch |
Iris EM | EM Algorithm / Clustering | Iris Data Set | EM From Scratch | Notes |
Iris Logistic | Logistic Regression / Classification | Iris Data Set | Logistic Regression From Scratch, Logistic Regression Scikit Learn, SVM (used for compare) | Notes |
Machine Learning Categories
Consider the learning task
-
Surpervised Learning
- Classification - Discrete
- Regression - Continuous
-
Unsupervised Learning
- Clustering - Discrete
- Dimensionality Reduction - Continuous
- Association Rule Learning
-
Semi-supervised Learning
- Semi-Clustering
- Semi-Classification
- Reinforcement Learning
Consider the learning model
-
Discriminative Model
- Discriminative Function
- Probabilistic Discriminative Model
- Generative Model
Cosider the desired output of a ML system
-
Classification
-
Logistic Regression
(optimization algo.)-
Multinomial/Softmax Regression (SMR)
-
-
k-Nearest Neighbors (kNN)
-
Support Vector Machine (SVM)
- Derivation (optimization algo.) -
Naive Bayes
-
Decision Tree (ID3, C4.5, CART)
-
-
Regression
-
Linear Regression
- Derivation (optimization algo.) -
Tree (CART)
-
-
Clustering
-
k-Means
-
Hierarchical Clustering
-
DBSCAN
-
-
Association Rule Learning
-
Apriori
-
Eclat
-
FP-growth
- Frequent itemsets mining
-
-
Dimensionality Reduction
-
Principal Compnent Analysis (PCA)
-
Single Value Decomposition (SVD)
- LSA, LSI, Recommendation System -
Canonical Correlation Analysis (CCA)
-
Isomap
(nonlinear) -
Locally Linear Embedding (LLE)
(nonlinear) -
Laplancian Eigenmaps
(nonlinear)
-
Ensemble Method (Meta-algorithm)
- Bagging
-
Random Forests
-
- Boosting
-
AdaBoost
<- With some basic boosting notes -
Gradient Boosting
-
Gradient Boosting Decision Tree (GBDT)
(aka. Multiple Additive Regression Tree (MART))
-
-
XGBoost
-
LightGBM
-
NLP Related
-
Hidden Markov Model (HMM)
- Sequencial Labeling Problem -
Conditional Random Field (CRF)
- Classification Problem (e.g. Sentiment Analysis)
Backbone
-
Maximum Entropy Model (MEM)
-
Bayesian Network
(aka. Probabilistic Directed Acyclic Graphical Model)
Others
-
Probabilistic Latent Semantic Analysis (PLSA)
-
Latent Dirichlet Allocation (LDA)
-
Vector Space Model (VSM)
-
Radial Basic Function (RBF) Network
-
Isolation Forest
-
One-Class SVM
Heuristic Algorithm (Optimization Method)
-
SMO
--> SVM -
EM
--> HMM, etc. -
GIS
== improved ==>IIS
--> MEM
Machine Learning Concepts
General Case
-
Data Preprocessing
- Normalization
- Training and Test Sets - Splitting Data
- Missing Value
- Dimensionality Reduction
- Feature Scaling
-
Model Expansion
- Binary to Multi-class
-
Fitting and Model Complexity
- Overfitting
- Underfitting
- Generalization
- Regularization
-
Reducing Loss
- Learning Rate
- Gradient Descent
-
Other Learning Method
- Cost-sensitive Learning
- Lazy Learning
- Incremental Learning (Online Learning)
- Multi-label Classification
Categorized
- Classification
- Data Preprocessing
- Label Encoding
- Real-world Problem
- Cost-sensitive Learning
- Classification Imbalance
- Evaluation Metrics
- Classification Metrics
- Binary to Multi-class Expension
- Data Preprocessing
- Regression
- Evaluation Metrics
- Regression Metrics
- Evaluation Metrics
- Clustering
- Evaluation Metrics
- Clustering Metrics
- Evaluation Metrics
Specific Field
- Data Mining - Knowledge Discovering
-
Feature Engineering
- Training optimization
- Memory usage
- Evaluation time complexity
- Training optimization
-
Recommendation System
- Collaborative Filtering (CF)
-
Information Retrieval - Topic Modelling
- Latent Semantic Analysis (LSA/LSI/SVD)
- Latent Dirichlet Allocation (LDA)
- Random Projections (RP)
- Hierarchical Dirichlet Process (HDP)
- word2vec
Machine Learning Mathematics
Topic
- Kernel Usages
- Convex Optimization
- Distance/Similarity Measurement - basis of clustering and recommendation system
Categories
- Linear Algebra
- Orthogonality
- Eigenvalues
- Hessian Matrix
- Quadratic Form
- Markov Chain - HMM
- Calculus
-
Multivariable Deratives
- Quadratic Approximations
- Lagrange Multipliers and Constrained Optimization - SVM SMO
- Lagrange Duality
-
Multivariable Deratives
- Probability and Statistics
- Statistical Estimation
- Maximum Likelihood Estimation (MLE)
- Statistical Estimation
Basics
- Algebra
- Trigonometry
Application
(from A to Z)
- Decision Tree
- Entropy
- HMM
- Markov Chain
- Naive Bayes
- Bayes' Theorem
- PCA
- Orthogonal Transformations
- Eigenvalues
- SVD
- Eigenvalues
- SVM
- Convex Optimization
- Constrained Optimization
- Lagrange Multipliers
- Kernel
Books Recommendation
Machine Learning
- Machine Learning in Action
- 統計學習方法 (李航)
- 機器學習 (周志華) (alias 西瓜書)
- Python Machine Learning
-
Introduction to Machine Learning 3rd
- Solution Manual
- Previous version: 1st, 2nd
- Automated Machine Learning: Methods, Systems, Challenges (AutoML)
Mathematics
- Linear Algebra with Applications (Steven Leon)
- Convex Optimization (Stephen Boyd & Lieven Vandenberghe)
- Numerical Linear Algebra (L. Trefethen & D. Bau III)
Resources
Tutorial
Videos
- Google - Machine Learning Recipes with Josh Gordon
- Youtube - Machine Learning Fun and Easy
- Siraj Raval - The Math of Intelligence
- bilibili - 機器學習 - 白板推導系列
- bilibili - 機器學習升級版
Documentations
Interactive Learning
- Google Machine Learning Crash Course
- Learn with Google AI
- Kaggle Learn Machine Learning
- Microsoft Professional Program - Artificial Intelligence track
- Intel AI Developer Program - AI Courses
MOOC
Github
- Machine Learning from Scratch (eriklindernoren/ML-From-Scratch)
- Avik-Jain/100-Days-Of-ML-Code - 100 Days of ML Coding
- ddbourgin/numpy-ml - Machine learning, in numpy
- Machine learning Resources
- microsoft/recommenders - Best Practices on Recommendation Systems
- dformoso/machine-learning-mindmap
Textbook Implementation
- Machine Learning in Action
- Learning From Data (林軒田)
- 統計學習方法 (李航)
- Stanford Andrew Ng CS229
- NTU Hung-Yi Lee
Datasets
- UCI Machine Learning Repository
- Awesome Public Datasets
- Kaggle Datasets
- The MNIST Database of handwritten digits
- 資料集平台 Data Market
- AI Challenger Datasets
- Peking University Open Research Data
- Open Images Dataset
- Alibaba Cloud Tianchi Data Lab
- biendata
Competition
Global
Taiwan
China
Machine Learning Platform
Machine Learning Tool
- AutoML
- Optuna - A hyperparameter optimization framework
- Hyperopt - Distributed Asynchronous Hyper-parameter Optimization
(Online) Development Environment
-
Extension plugin -
pip install jupyter_contrib_nbextensions
- VIM binding
- Codefolding
- ExecuteTime
- Notify
-
Jupyter Theme -
pip install --upgrade jupyterthemes