learning-papers
learning-papers copied to clipboard
Landmark Papers in Machine Learning
Landmark Papers in Machine Learning
This document attempts to collect the papers which developed important techniques in machine learning. Research is a collaborative process, discoveries are made independently, and the difference between the original version and a precursor can be subtle, but Iβve done my best to select the papers that I think are novel or significant.
My opinions are by no means the final word on these topics. Please create an issue or pull request if you have a suggestion.
-
Landmark Papers in Machine Learning
- Key
- Association Rule Learning
-
Datasets
- Enron
- ImageNet
- Decision Trees
-
Deep Learning
- AlexNet (image classification CNN)
- Convolutional Neural Network
- DeepFace (facial recognition)
- Generative Adversarial Network
- GPT
- Inception (classification/detection CNN)
- Long Short-Term Memory (LSTM)
- Residual Neural Network (ResNet)
- Transformer (sequence to sequence modeling)
- U-Net (image segmentation CNN)
- VGG (image recognition CNN)
-
Ensemble Methods
- AdaBoost
- Bagging
- Gradient Boosting
- Random Forest
-
Games
- AlphaGo
- Deep Blue
-
Optimization
- Adam
- Expectation Maximization
- Stochastic Gradient Descent
-
Miscellaneous
- Non-negative Matrix Factorization
- PageRank
- DeepQA (Watson)
-
Natural Language Processing
- Latent Dirichlet Allocation
- Latent Semantic Analysis
- Word2Vec
-
Neural Network Components
- Autograd
- Back-propagation
- Batch Normalization
- Dropout
- Gated Recurrent Unit
- Perceptron
-
Recommender Systems
- Collaborative Filtering
- Matrix Factorization
- Implicit Matrix Factorization
-
Regression
- Elastic Net
- Lasso
-
Software
- MapReduce
- TensorFlow
- Torch
-
Supervised Learning
- k-Nearest Neighbors
- Support Vector Machine
-
Statistics
- The Bootstrap
- Credits
Key
Icon | |
---|---|
π | Paper behind paywall. In some cases, I provide an alternative link to the paper if it comes directly from one of the authors. |
π | Freely available version of paywalled paper, directly from the author. |
π½ | Code associated with the paper. |
ποΈ | Precursor or historically relevant paper. This may be a fundamental breakthrough that paved the way for the concept in question to be developed. |
π¬ | Iteration, advancement, elaboration, or major popularization of a technique. |
π | Blog post or something other than a formal publication. |
π | Website associated with the paper. |
π₯ | Video associated with the paper. |
π | Slides or images associated with the paper. |
Papers proceeded by βSee alsoβ indicate either additional historical context or else major developments, breakthroughs, or applications.
Association Rule Learning
-
Mining Association Rules between Sets of Items in Large Databases (1993), Agrawal, Imielinski, and Swami, @CiteSeerX.
-
See also: The GUHA method of automatic hypotheses determination (1966), HΓ‘jek, Havel, and Chytil, @Springer π ποΈ.
Datasets
Enron
- The Enron Corpus: A New Dataset for Email Classification Research (2004), Klimt and Yang, @Springer π / @author π.
- See also: Introducing the Enron Corpus (2004), Klimt and Yang, @author.
ImageNet
- ImageNet: A large-scale hierarchical image database (2009), Deng et al., @IEEE π / @author π.
- See also: ImageNet Large Scale Visual Recognition Challenge (2015), @Springer π / @arXiv π + @author π.
Decision Trees
- Induction of Decision Trees (1986), Quinlan, @Springer.
Deep Learning
AlexNet (image classification CNN)
- ImageNet Classification with Deep Convolutional Neural Networks (2012), @NIPS.
Convolutional Neural Network
- Gradient-based learning applied to document recognition (1998), LeCun, Bottou, Bengio, and Haffner, @IEEE π / @author π.
- See also: Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position (1980), Fukushima, @Springer ποΈ.
- See also: Phoneme recognition using time-delay neural networks (1989), Waibel, Hanazawa, Hinton, Shikano, and Lang, @IEEE ποΈ.
- See also: Fully Convolutional Networks for Semantic Segmentation (2014), Long, Shelhamer, and Darrell, @arXiv.
DeepFace (facial recognition)
- DeepFace: Closing the Gap to Human-Level Performance in Face Verification (2014), Taigman, Yang, Ranzato, and Wolf, Facebook Research.
Generative Adversarial Network
GPT
- Improving Language Understanding by Generative Pre-Training (2018) aka GPT, Radford, Narasimhan, Salimans, and Sutskever, @OpenAI + @Github π½ + @OpenAI π.
- See also: Language Models are Unsupervised Multitask Learners (2019) aka GPT-2, Radford, Wu, Child, Luan, Amodei, and Sutskever, @OpenAI π¬ + @Github π½ + @OpenAI π.
- See also: Language Models are Few-Shot Learners (2020) aka GPT-3, Brown et al., @arXiv + @OpenAI π.
Inception (classification/detection CNN)
- Going Deeper with Convolutions (2014), Szegedy et al., @ai.google + @Github π½.
- See also: Rethinking the Inception Architecture for Computer Vision (2016), Szegedy, Vanhoucke, Ioffe, Shlens, and Wojna, @ai.google π¬.
- See also: Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning (2016), Szegedy, Ioffe, Vanhoucke, and Alemi, @ai.google π¬.
Long Short-Term Memory (LSTM)
- Long Short-term Memory (1995), Hochreiter and Schmidhuber, @CiteSeerX.
Residual Neural Network (ResNet)
- Deep Residual Learning for Image Recognition (2015), He, Zhang, Ren, and Sun, @arXiv.
Transformer (sequence to sequence modeling)
- Attention Is All You Need (2017), Vaswani et al., @NIPS.
U-Net (image segmentation CNN)
- U-Net: Convolutional Networks for Biomedical Image Segmentation (2015), Ronneberger, Fischer, Brox, @Springer π / @arXiv π.
VGG (image recognition CNN)
- Very Deep Convolutional Networks for Large-Scale Image Recognition (2015), Simonyan and Zisserman, @arXiv + @author π + @ICLR π + @YouTube π₯.
Ensemble Methods
AdaBoost
-
A Decision-Theoretic Generalization of on-Line Learning and an Application to Boosting (1997βpublished as abstract in 1995), Freund and Schapire, @CiteSeerX.
-
See also: Experiments with a New Boosting Algorithm (1996), Freund and Schapire, @CiteSeerX π¬.
Bagging
- Bagging Predictors (1996), Breiman, @Springer.
Gradient Boosting
- Greedy function approximation: A gradient boosting machine (2001), Friedman, @Project Euclid.
- See also: XGBoost: A Scalable Tree Boosting System (2016), Chen and Guestrin, @arXiv π¬ + @GitHub π½.
Random Forest
- Random Forests (2001), Breiman and Schapire, @CiteSeerX.
Games
AlphaGo
- Mastering the game of Go with deep neural networks and tree search (2016), Silver et al., @Nature.
Deep Blue
- IBM's deep blue chess grandmaster chips (1999), Hsu, @IEEE π.
- See also: Deep Blue (2002), Campbell, Hoane, and Hsu, @ScienceDirect π.
Optimization
Adam
- Adam: A Method for Stochastic Optimization (2015), Kingma and Ba, @arXiv.
Expectation Maximization
- Maximum likelihood from incomplete data via the EM algorithm (1977), Dempster, Laird, and Rubin, @CiteSeerX.
Stochastic Gradient Descent
- Stochastic Estimation of the Maximum of a Regression Function (1952), Kiefer and Wolfowitz, @ProjectEuclid.
- See also: A Stochastic Approximation Method (1951), Robbins and Monro, @ProjectEuclid ποΈ.
Miscellaneous
Non-negative Matrix Factorization
- Learning the parts of objects by non-negative matrix factorization (1999), Lee and Seung, @Nature π.
PageRank
- The PageRank Citation Ranking: Bringing Order to the Web (1998), Page, Brin, Motwani, and Winograd, @CiteSeerX.
DeepQA (Watson)
- Building Watson: An Overview of the DeepQA Project (2010), Ferrucci et al., @AAAI.
Natural Language Processing
Latent Dirichlet Allocation
- Latent Dirichlet Allocation (2003), Blei, Ng, and Jordan, @JMLR
Latent Semantic Analysis
- Indexing by latent semantic analysis (1990), Deerwater, Dumais, Furnas, Landauer, and Harshman, @CiteSeerX.
Word2Vec
- Efficient Estimation of Word Representations in Vector Space (2013), Mikolov, Chen, Corrado, and Dean, @arXiv + @Google Code π½.
Neural Network Components
Autograd
Back-propagation
- Learning representations by back-propagating errors (1986), Rumelhart, Hinton, and Williams, @Nature π.
- See also: Backpropagation Applied to Handwritten Zip Code Recognition (1989), LeCun et al., @IEEE ππ¬ / @author π.
Batch Normalization
- Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift (2015), Ioffe and Szegedy @ICML via PMLR.
Dropout
- Dropout: A Simple Way to Prevent Neural Networks from Overfitting (2014), Srivastava, Hinton, Krizhevsky, Sutskever, and Salakhutdinov, @JMLR.
Gated Recurrent Unit
- Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation (2014), Cho et al, @arXiv.
Perceptron
- The Perceptron: A Probabilistic Model for Information Storage and Organization in The Brain (1958), Rosenblatt, @CiteSeerX.
Recommender Systems
Collaborative Filtering
- Using collaborative filtering to weave an information tapestry (1992), Goldberg, Nichols, Oki, and Terry, @CiteSeerX.
Matrix Factorization
- Application of Dimensionality Reduction in Recommender System - A Case Study (2000), Sarwar, Karypis, Konstan, and Riedl, @CiteSeerX.
- See also: Learning Collaborative Information Filters (1998), Billsus and Pazzani, @CiteSeerX ποΈ.
- See also: Netflix Update: Try This at Home (2006), Funk, @author π π¬.
Implicit Matrix Factorization
- Collaborative Filtering for Implicit Feedback Datasets (2008), Hu, Koren, and Volinsky, @IEEE π / @author π.
Regression
Elastic Net
- Regularization and variable selection via the Elastic Net (2005), Zou and Hastie, @CiteSeer.
Lasso
- Regression Shrinkage and Selection Via the Lasso (1994), Tibshirani, @CiteSeerX.
- See also: Linear Inversion of Band-Limited Reflection Seismograms (1986), Santosa and Symes, @SIAM ποΈ.
Software
MapReduce
- MapReduce: Simplified Data Processing on Large Clusters (2004), Dean and Ghemawat, @ai.google.
TensorFlow
- TensorFlow: A system for large-scale machine learning (2016), Abadi et al., @ai.google + @author π.
Torch
- Torch: A Modular Machine Learning Software Library (2002), Collobert, Bengio and MariΓ©thoz, @Idiap + @author π.
- See also: Automatic differentiation in PyTorch (2017), Paszke et al., @OpenReview π¬+ @Github π½.
Supervised Learning
k-Nearest Neighbors
- Nearest neighbor pattern classification (1967), Cover and Hart, @IEEE π.
- See also: E. Fix and J.L. Hodges (1951): An Important Contribution to Nonparametric Discriminant Analysis and Density Estimation (1989), Silverman and Jones, @JSTOR π.
Support Vector Machine
- Support Vector Networks (1995), Cortes and Vapnik, @Springer.
Statistics
The Bootstrap
- Bootstrap Methods: Another Look at the Jackknife (1979), Efron, @Project Euclid.
- See also: Problems in Plane Sampling (1949), Quenouille, @Project Euclid ποΈ.
- See also: Notes on Bias Estimation (1958), Quenouille, @JSTOR ποΈ.
- See also: Bias and Confidence in Not-quite Large Samples (1958), Tukey, @Project Euclid π¬.
Credits
A special thanks to Alexandre Passos for his comment on this Reddit thread, as well as the responders to this Quora post. They provided many great papers to get this list off to a great start.