daturkel/learning-papers: Landmark Papers in Machine Learning

Landmark Papers in Machine Learning

This document attempts to collect the papers which developed important techniques in machine learning. Research is a collaborative process, discoveries are made independently, and the difference between the original version and a precursor can be subtle, but I’ve done my best to select the papers that I think are novel or significant.

My opinions are by no means the final word on these topics. Please create an issue or pull request if you have a suggestion.

Landmark Papers in Machine Learning
- Key
- Association Rule Learning
- Datasets
  - Enron
  - ImageNet
- Decision Trees
- Deep Learning
  - AlexNet (image classification CNN)
  - Convolutional Neural Network
  - DeepFace (facial recognition)
  - Generative Adversarial Network
  - GPT
  - Inception (classification/detection CNN)
  - Long Short-Term Memory (LSTM)
  - Residual Neural Network (ResNet)
  - Transformer (sequence to sequence modeling)
  - U-Net (image segmentation CNN)
  - VGG (image recognition CNN)
- Ensemble Methods
  - AdaBoost
  - Bagging
  - Gradient Boosting
  - Random Forest
- Games
  - AlphaGo
  - Deep Blue
- Optimization
  - Adam
  - Expectation Maximization
  - Stochastic Gradient Descent
- Miscellaneous
  - Non-negative Matrix Factorization
  - PageRank
  - DeepQA (Watson)
- Natural Language Processing
  - Latent Dirichlet Allocation
  - Latent Semantic Analysis
  - Word2Vec
- Neural Network Components
  - Autograd
  - Back-propagation
  - Batch Normalization
  - Dropout
  - Gated Recurrent Unit
  - Perceptron
- Recommender Systems
  - Collaborative Filtering
  - Matrix Factorization
  - Implicit Matrix Factorization
- Regression
  - Elastic Net
  - Lasso
- Software
  - MapReduce
  - TensorFlow
  - Torch
- Supervised Learning
  - k-Nearest Neighbors
  - Support Vector Machine
- Statistics
  - The Bootstrap
Credits

Key

Icon
🔒	Paper behind paywall. In some cases, I provide an alternative link to the paper if it comes directly from one of the authors.
🔑	Freely available version of paywalled paper, directly from the author.
💽	Code associated with the paper.
🏛️	Precursor or historically relevant paper. This may be a fundamental breakthrough that paved the way for the concept in question to be developed.
🔬	Iteration, advancement, elaboration, or major popularization of a technique.
📔	Blog post or something other than a formal publication.
🌐	Website associated with the paper.
🎥	Video associated with the paper.
📊	Slides or images associated with the paper.

Papers proceeded by “See also” indicate either additional historical context or else major developments, breakthroughs, or applications.

Association Rule Learning

Mining Association Rules between Sets of Items in Large Databases (1993), Agrawal, Imielinski, and Swami, @CiteSeerX.
See also: The GUHA method of automatic hypotheses determination (1966), Hájek, Havel, and Chytil, @Springer 🔒 🏛️.

Datasets

Enron

The Enron Corpus: A New Dataset for Email Classification Research (2004), Klimt and Yang, @Springer 🔒 / @author 🔑.
See also: Introducing the Enron Corpus (2004), Klimt and Yang, @author.

ImageNet

ImageNet: A large-scale hierarchical image database (2009), Deng et al., @IEEE 🔒 / @author 🔑.
See also: ImageNet Large Scale Visual Recognition Challenge (2015), @Springer 🔒 / @arXiv 🔑 + @author 🌐.

Decision Trees

Induction of Decision Trees (1986), Quinlan, @Springer.

Deep Learning

AlexNet (image classification CNN)

ImageNet Classification with Deep Convolutional Neural Networks (2012), @NIPS.

Convolutional Neural Network

Gradient-based learning applied to document recognition (1998), LeCun, Bottou, Bengio, and Haffner, @IEEE 🔒 / @author 🔑.
See also: Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position (1980), Fukushima, @Springer 🏛️.
See also: Phoneme recognition using time-delay neural networks (1989), Waibel, Hanazawa, Hinton, Shikano, and Lang, @IEEE 🏛️.
See also: Fully Convolutional Networks for Semantic Segmentation (2014), Long, Shelhamer, and Darrell, @arXiv.

DeepFace (facial recognition)

DeepFace: Closing the Gap to Human-Level Performance in Face Verification (2014), Taigman, Yang, Ranzato, and Wolf, Facebook Research.

Generative Adversarial Network

General Adversarial Nets (2014), Goodfellow et al., @NIPS + @Github 💽.

GPT

Improving Language Understanding by Generative Pre-Training (2018) aka GPT, Radford, Narasimhan, Salimans, and Sutskever, @OpenAI + @Github 💽 + @OpenAI 📔.
See also: Language Models are Unsupervised Multitask Learners (2019) aka GPT-2, Radford, Wu, Child, Luan, Amodei, and Sutskever, @OpenAI 🔬 + @Github 💽 + @OpenAI 📔.
See also: Language Models are Few-Shot Learners (2020) aka GPT-3, Brown et al., @arXiv + @OpenAI 📔.

Inception (classification/detection CNN)

Going Deeper with Convolutions (2014), Szegedy et al., @ai.google + @Github 💽.
See also: Rethinking the Inception Architecture for Computer Vision (2016), Szegedy, Vanhoucke, Ioffe, Shlens, and Wojna, @ai.google 🔬.
See also: Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning (2016), Szegedy, Ioffe, Vanhoucke, and Alemi, @ai.google 🔬.

Long Short-Term Memory (LSTM)

Long Short-term Memory (1995), Hochreiter and Schmidhuber, @CiteSeerX.

Residual Neural Network (ResNet)

Deep Residual Learning for Image Recognition (2015), He, Zhang, Ren, and Sun, @arXiv.

Transformer (sequence to sequence modeling)

Attention Is All You Need (2017), Vaswani et al., @NIPS.

U-Net (image segmentation CNN)

U-Net: Convolutional Networks for Biomedical Image Segmentation (2015), Ronneberger, Fischer, Brox, @Springer 🔒 / @arXiv 🔑.

VGG (image recognition CNN)

Very Deep Convolutional Networks for Large-Scale Image Recognition (2015), Simonyan and Zisserman, @arXiv + @author 🌐 + @ICLR 📊 + @YouTube 🎥.

Ensemble Methods

AdaBoost

A Decision-Theoretic Generalization of on-Line Learning and an Application to Boosting (1997—published as abstract in 1995), Freund and Schapire, @CiteSeerX.
See also: Experiments with a New Boosting Algorithm (1996), Freund and Schapire, @CiteSeerX 🔬.

Bagging

Bagging Predictors (1996), Breiman, @Springer.

Gradient Boosting

Greedy function approximation: A gradient boosting machine (2001), Friedman, @Project Euclid.
See also: XGBoost: A Scalable Tree Boosting System (2016), Chen and Guestrin, @arXiv 🔬 + @GitHub 💽.

Random Forest

Random Forests (2001), Breiman and Schapire, @CiteSeerX.

Games

AlphaGo

Mastering the game of Go with deep neural networks and tree search (2016), Silver et al., @Nature.

Deep Blue

IBM's deep blue chess grandmaster chips (1999), Hsu, @IEEE 🔒.
See also: Deep Blue (2002), Campbell, Hoane, and Hsu, @ScienceDirect 🔒.

Optimization

Adam

Adam: A Method for Stochastic Optimization (2015), Kingma and Ba, @arXiv.

Expectation Maximization

Maximum likelihood from incomplete data via the EM algorithm (1977), Dempster, Laird, and Rubin, @CiteSeerX.

Stochastic Gradient Descent

Stochastic Estimation of the Maximum of a Regression Function (1952), Kiefer and Wolfowitz, @ProjectEuclid.
See also: A Stochastic Approximation Method (1951), Robbins and Monro, @ProjectEuclid 🏛️.

Miscellaneous

Non-negative Matrix Factorization

Learning the parts of objects by non-negative matrix factorization (1999), Lee and Seung, @Nature 🔒.

PageRank

The PageRank Citation Ranking: Bringing Order to the Web (1998), Page, Brin, Motwani, and Winograd, @CiteSeerX.

DeepQA (Watson)

Building Watson: An Overview of the DeepQA Project (2010), Ferrucci et al., @AAAI.

Natural Language Processing

Latent Dirichlet Allocation

Latent Dirichlet Allocation (2003), Blei, Ng, and Jordan, @JMLR

Latent Semantic Analysis

Indexing by latent semantic analysis (1990), Deerwater, Dumais, Furnas, Landauer, and Harshman, @CiteSeerX.

Word2Vec

Efficient Estimation of Word Representations in Vector Space (2013), Mikolov, Chen, Corrado, and Dean, @arXiv + @Google Code 💽.

Neural Network Components

Autograd

Autograd: Effortless Gratients in Numpy (2015), @ICML + @ICML 📊 + @Github 💽.

Back-propagation

Learning representations by back-propagating errors (1986), Rumelhart, Hinton, and Williams, @Nature 🔒.
See also: Backpropagation Applied to Handwritten Zip Code Recognition (1989), LeCun et al., @IEEE 🔒🔬 / @author 🔑.

Batch Normalization

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift (2015), Ioffe and Szegedy @ICML via PMLR.

Dropout

Dropout: A Simple Way to Prevent Neural Networks from Overfitting (2014), Srivastava, Hinton, Krizhevsky, Sutskever, and Salakhutdinov, @JMLR.

Gated Recurrent Unit

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation (2014), Cho et al, @arXiv.

Perceptron

The Perceptron: A Probabilistic Model for Information Storage and Organization in The Brain (1958), Rosenblatt, @CiteSeerX.

Recommender Systems

Collaborative Filtering

Using collaborative filtering to weave an information tapestry (1992), Goldberg, Nichols, Oki, and Terry, @CiteSeerX.

Matrix Factorization

Application of Dimensionality Reduction in Recommender System - A Case Study (2000), Sarwar, Karypis, Konstan, and Riedl, @CiteSeerX.
See also: Learning Collaborative Information Filters (1998), Billsus and Pazzani, @CiteSeerX 🏛️.
See also: Netflix Update: Try This at Home (2006), Funk, @author 📔 🔬.

Implicit Matrix Factorization

Collaborative Filtering for Implicit Feedback Datasets (2008), Hu, Koren, and Volinsky, @IEEE 🔒 / @author 🔑.

Regression

Elastic Net

Regularization and variable selection via the Elastic Net (2005), Zou and Hastie, @CiteSeer.

Lasso

Regression Shrinkage and Selection Via the Lasso (1994), Tibshirani, @CiteSeerX.
See also: Linear Inversion of Band-Limited Reflection Seismograms (1986), Santosa and Symes, @SIAM 🏛️.

Software

MapReduce

MapReduce: Simplified Data Processing on Large Clusters (2004), Dean and Ghemawat, @ai.google.

TensorFlow

TensorFlow: A system for large-scale machine learning (2016), Abadi et al., @ai.google + @author 🌐.

Torch

Torch: A Modular Machine Learning Software Library (2002), Collobert, Bengio and Mariéthoz, @Idiap + @author 🌐.
See also: Automatic differentiation in PyTorch (2017), Paszke et al., @OpenReview 🔬+ @Github 💽.

Supervised Learning

k-Nearest Neighbors

Nearest neighbor pattern classification (1967), Cover and Hart, @IEEE 🔒.
See also: E. Fix and J.L. Hodges (1951): An Important Contribution to Nonparametric Discriminant Analysis and Density Estimation (1989), Silverman and Jones, @JSTOR 🔒.

Support Vector Machine

Support Vector Networks (1995), Cortes and Vapnik, @Springer.

Statistics

The Bootstrap

Bootstrap Methods: Another Look at the Jackknife (1979), Efron, @Project Euclid.
See also: Problems in Plane Sampling (1949), Quenouille, @Project Euclid 🏛️.
See also: Notes on Bias Estimation (1958), Quenouille, @JSTOR 🏛️.
See also: Bias and Confidence in Not-quite Large Samples (1958), Tukey, @Project Euclid 🔬.

Credits

A special thanks to Alexandre Passos for his comment on this Reddit thread, as well as the responders to this Quora post. They provided many great papers to get this list off to a great start.

learning-papers learning-papers copied to clipboard

Metadata