MachineLearningPractice icon indicating copy to clipboard operation
MachineLearningPractice copied to clipboard

Some practices using statistical machine learning technique based on some dataset. (notes and doing from scratch)

Machine Learning Practice

Some practices using statistical machine learning technique based on some dataset.

To see more detail or example about deep learning, you can checkout my Deep Learning repository.

Because Github don't support LaTeX for now, you can use the Google Chrome extension TeX All the Things (github) to read the notes.

Environment

  • Using Python 3

(most of the relative path links are according to the repository root)

Dependencies

  • numpy: For low-level math operations
  • pandas: For data manipulation
  • sklearn - Scikit Learn: For evaluation metrics, some data preprocessing

For comparison purpose

For visualization

  • Mlxtend
  • matplotlib
    • matplotlib.pyplot
    • mpl_toolkits.mplot3d

For evaluation

  • surprise: A Python scikit building and analyzing recommender systems

NLP related

  • gensim: Topic Modelling
  • hmmlearn: Hidden Markov Models in Python, with scikit-learn like API
  • jieba: Chinese text segementation library
  • pyHanLP: Chinese NLP library (Python API)
  • nltk: Natural Language Toolkit

Projects

Subject Technique / Task Dataset Solution Notes
Letter Recognition kNN / Classification Letter Recognition Datasets (File) kNN From Scratch, kNN Scikit Learn Notes
Page Blocks Classification Decision Tree / Classification Page Blocks Classification Data Set (File) Decision Tree (CART) From Scratch, Decision Tree Scikit Learn Notes
CSM Linear Regression / Regression CSM Dataset (2014 and 2015) (File) Linear Regression From Scratch, Linear Regression Scikit Learn, Linear Regression PyTorch NN Notes
Nursery Naive Bayes / Classification Nursery Data Set (File) Gaussian Naive Bayes From Scratch, Gaussian Naive Bayes Scikit Learn Notes
Post-Operative Patient SVM (cvxopt) / Binary Classification Post-Operative Patient Data Set (File, Simplified) SVM From Scratch (using cvxopt and simplified dataset), SVM Scikit Learn Notes
Student Performance AdaBoost / Classification Student Performance Data Set (File) AdaBoost From Scratch, AdaBoost Scikit Learn Notes
Sales Transactions k-Means / Clustering Sales Transactions Dataset Weekly (File) k-Means From Scratch, k-Means Scikit Learn Notes
Frequent Itemset Mining FP-Growth / Frequent Itemsets Mining Retail Market Basket Data Set (File) FP-Growth From Scratch Notes
Automobile PCA / Dimensionality Reduction Automobile Data Set (File) PCA From Scratch, PCA Scikit Learn Notes
Anonymous Microsoft Web Data SVD / Recommendation System Anonymous Microsoft Web Data Data Set (File, Ratings Matrix (by R)) SVD From Scratch, R Notebook - IBCF Recommender System Notes
Handwriting Digit SVM (SMO) / Binary & Multi-class Classification MNIST (File) Binary SVM From Scratch, Multi-class (OVR) SVM From Scratch Notes
Chinese Text Segmentation HMM (EM) / Text Segmentation & POS Tagging File HMM From Scratch, HMM hmmlearn, Compare with Jieba and HanLP -
Document Similarity and LSI VSM, SVD / LSI Corpus of the People's Daily (File) VSM From Scratch, VSM Gensim, SVD/LSI Gensim Notes
Click and Conversion Prediction Logistic Regression / Recommendation System Ali-CCP (File too large about 20GB) Notes
LightGBM & XGBoost & CatBoost Practice Boosting Tree / Classification Social Network Ads (File) LightGBM, XGBoost Notes
Kaggle Elo LightGBM / Feature Engineering Elo Merchant Category Recommendation LightGBM Project
DCIC 2019 LXGBoost / Feature Engineering Failure Prediction of Concrete Piston for Concrete Pump Vehicles XGBoost Project
Epinions CLiMF Collaborative Filtering / Recommendation System Epinions CLiMF From Scratch, CLiMF TensorFlow Notes, PaperResearch
Iris EM EM Algorithm / Clustering Iris Data Set EM From Scratch Notes
Iris Logistic Logistic Regression / Classification Iris Data Set Logistic Regression From Scratch, Logistic Regression Scikit Learn, SVM (used for compare) Notes

Machine Learning Categories

Consider the learning task

  • Surpervised Learning
    • Classification - Discrete
    • Regression - Continuous
  • Unsupervised Learning
    • Clustering - Discrete
    • Dimensionality Reduction - Continuous
    • Association Rule Learning
  • Semi-supervised Learning
    • Semi-Clustering
    • Semi-Classification
  • Reinforcement Learning

Consider the learning model

  • Discriminative Model
    • Discriminative Function
    • Probabilistic Discriminative Model
  • Generative Model

Cosider the desired output of a ML system

  • Classification
    • Logistic Regression (optimization algo.)
      • Multinomial/Softmax Regression (SMR)
    • k-Nearest Neighbors (kNN)
    • Support Vector Machine (SVM) - Derivation (optimization algo.)
    • Naive Bayes
    • Decision Tree (ID3, C4.5, CART)
  • Regression
    • Linear Regression - Derivation (optimization algo.)
    • Tree (CART)
  • Clustering
    • k-Means
    • Hierarchical Clustering
    • DBSCAN
  • Association Rule Learning
    • Apriori
    • Eclat
    • FP-growth - Frequent itemsets mining
  • Dimensionality Reduction
    • Principal Compnent Analysis (PCA)
    • Single Value Decomposition (SVD) - LSA, LSI, Recommendation System
    • Canonical Correlation Analysis (CCA)
    • Isomap (nonlinear)
    • Locally Linear Embedding (LLE) (nonlinear)
    • Laplancian Eigenmaps (nonlinear)

Ensemble Method (Meta-algorithm)

  • Bagging
    • Random Forests
  • Boosting
    • AdaBoost <- With some basic boosting notes
    • Gradient Boosting
      • Gradient Boosting Decision Tree (GBDT) (aka. Multiple Additive Regression Tree (MART))
    • XGBoost
    • LightGBM

NLP Related

  • Hidden Markov Model (HMM) - Sequencial Labeling Problem
  • Conditional Random Field (CRF) - Classification Problem (e.g. Sentiment Analysis)

Backbone

  • Maximum Entropy Model (MEM)
  • Bayesian Network (aka. Probabilistic Directed Acyclic Graphical Model)

Others

  • Probabilistic Latent Semantic Analysis (PLSA)
  • Latent Dirichlet Allocation (LDA)
  • Vector Space Model (VSM)
  • Radial Basic Function (RBF) Network
  • Isolation Forest
  • One-Class SVM

Heuristic Algorithm (Optimization Method)

  • SMO --> SVM
  • EM --> HMM, etc.
  • GIS == improved ==> IIS --> MEM

Machine Learning Concepts

General Case

  • Data Preprocessing
    • Normalization
    • Training and Test Sets - Splitting Data
    • Missing Value
    • Dimensionality Reduction
    • Feature Scaling
  • Model Expansion
    • Binary to Multi-class
  • Fitting and Model Complexity
    • Overfitting
    • Underfitting
    • Generalization
    • Regularization
  • Reducing Loss
    • Learning Rate
    • Gradient Descent
  • Other Learning Method
    • Cost-sensitive Learning
    • Lazy Learning
    • Incremental Learning (Online Learning)
    • Multi-label Classification

Categorized

  • Classification
    • Data Preprocessing
      • Label Encoding
    • Real-world Problem
      • Cost-sensitive Learning
      • Classification Imbalance
    • Evaluation Metrics
      • Classification Metrics
    • Binary to Multi-class Expension
  • Regression
    • Evaluation Metrics
      • Regression Metrics
  • Clustering
    • Evaluation Metrics
      • Clustering Metrics

Specific Field

  • Data Mining - Knowledge Discovering
  • Feature Engineering
    • Training optimization
      • Memory usage
      • Evaluation time complexity
  • Recommendation System
    • Collaborative Filtering (CF)
  • Information Retrieval - Topic Modelling
    • Latent Semantic Analysis (LSA/LSI/SVD)
    • Latent Dirichlet Allocation (LDA)
    • Random Projections (RP)
    • Hierarchical Dirichlet Process (HDP)
    • word2vec

Machine Learning Mathematics

Topic

  • Kernel Usages
  • Convex Optimization
  • Distance/Similarity Measurement - basis of clustering and recommendation system

Categories

  • Linear Algebra
    • Orthogonality
    • Eigenvalues
    • Hessian Matrix
    • Quadratic Form
    • Markov Chain - HMM
  • Calculus
    • Multivariable Deratives
      • Quadratic Approximations
      • Lagrange Multipliers and Constrained Optimization - SVM SMO
      • Lagrange Duality
  • Probability and Statistics
    • Statistical Estimation
      • Maximum Likelihood Estimation (MLE)

Basics

  • Algebra
  • Trigonometry

Application

(from A to Z)

  • Decision Tree
    • Entropy
  • HMM
    • Markov Chain
  • Naive Bayes
    • Bayes' Theorem
  • PCA
    • Orthogonal Transformations
    • Eigenvalues
  • SVD
    • Eigenvalues
  • SVM
    • Convex Optimization
    • Constrained Optimization
    • Lagrange Multipliers
    • Kernel

Books Recommendation

Machine Learning

Mathematics

  • Linear Algebra with Applications (Steven Leon)
  • Convex Optimization (Stephen Boyd & Lieven Vandenberghe)
  • Numerical Linear Algebra (L. Trefethen & D. Bau III)

Resources

Tutorial

Videos

Documentations

Interactive Learning

MOOC

Github

Textbook Implementation

Datasets

Competition

Global

Taiwan

China

Machine Learning Platform

Machine Learning Tool

(Online) Development Environment

jupyter notebook

  • Extension plugin - pip install jupyter_contrib_nbextensions
    • VIM binding
    • Codefolding
    • ExecuteTime
    • Notify
  • Jupyter Theme - pip install --upgrade jupyterthemes