ml-systems-papers icon indicating copy to clipboard operation
ml-systems-papers copied to clipboard

Curated collection of papers in machine learning systems

Paper List for Machine Learning Systems

Paper list for broad topics in machine learning systems

NOTE: Survey papers are annotated with [Survey 🔍] prefix.

Table of Contents

  • Paper List for Machine Learning Systems
    • 1. Data Processing
      • 1.1 Data pipeline optimization
        • 1.1.1 General
        • 1.1.2 Prep stalls
        • 1.1.3 Fetch stalls (I/O)
        • 1.1.4 Specific workloads (GNN, DLRM)
      • 1.2 Caching and Distributed storage for ML training
      • 1.3 Data formats
      • 1.4 Data pipeline fairness and correctness
      • 1.5 Data labeling automation
    • 2. Training System
      • 2.1 Empirical Study on ML Jobs
      • 2.2 DL scheduling
      • 2.3 GPU sharing
      • 2.4 GPU memory management and optimization
      • 2.5 GPU memory usage estimate
      • 2.6 Distributed training (Parallelism)
        • 2024
        • 2023
        • 2022
        • 2021
        • 2020
        • ~2019
        • Survey Papers
      • 2.6 DL job failures
      • 2.7 Model checkpointing
      • 2.8 AutoML
      • 2.9 Communication optimization
      • 2.10 Energy-efficient DNN training (carbon-aware)
      • 2.11 DNN compiler
      • 2.12 Model pruning and compression
      • 2.13 GNN training system
      • 2.14 Congestion control for DNN training
    • 3. Inference System
    • 4. Federated Learning
    • 5. Privacy-Preserving ML
    • 6. ML APIs & Application-side Optimization
    • 7. ML for Systems
    • Others
  • References

1. Data Processing

1.1 Data pipeline optimization

1.1.1 General

  • [arxiv'24] cedar: Composable and Optimized Machine Learning Input Data Pipelines
  • [MLSys'22] Plumber: Diagnosing and Removing Performance Bottlenecks in Machine Learning Data Pipelines
  • [ISCA'22] Understanding Data Storage and Ingestion for Large-Scale Deep Recommendation Model Training
  • [SIGMOD'22] Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning Preprocessing Pipelines
  • [VLDB'21] Analyzing and Mitigating Data Stalls in DNN Training
  • [VLDB'21] tf.data: A Machine Learning Data Processing Framework

1.1.2 Prep stalls

1.1.3 Fetch stalls (I/O)

1.1.4 Specific workloads (GNN, DLRM)

1.2 Caching and Distributed storage for ML training

1.3 Data formats

  • [ECCV'22] L3: Accelerator-Friendly Lossless Image Format for High-Resolution, High-Throughput DNN Training
  • [VLDB'21] Progressive compressed records: Taking a byte out of deep learning data

1.4 Data pipeline fairness and correctness

  • [CIDR'21] Lightweight Inspection of Data Preprocessing in Native Machine Learning Pipelines

1.5 Data labeling automation

  • [VLDB'18] Snorkel: Rapid Training Data Creation with Weak Supervision

2. Training System

2.1 Empirical Study on ML Jobs

2.2 DL scheduling

2.3 GPU sharing

  • [EuroSys'24 (to appear)] Orion: Interference-aware, Fine-grained GPU Sharing for ML Applications
  • [ATC'23] Beware of Fragmentation: Scheduling GPU-Sharing Workloads with Fragmentation Gradient Descent
  • [NSDI'23] Transparent GPU Sharing in Container Clouds for Deep Learning Workloads
  • [ICPP'23] FaST-GShare: Enabling Efficient Spatio-Temporal GPU Sharing in Serverless Computing for Deep Learning Inference
  • [arxiv'23] MuxFlow: Efficient and Safe GPU Sharing in Large-Scale Production Deep Learning Clusters
  • [SoCC'22] MISO: exploiting multi-instance GPU capability on multi-tenant GPU clusters
  • [PACT'22] GPUPool: A Holistic Approach to Fine-Grained GPU Sharing in the Cloud
  • [ATC'21] Zico: Efficient GPU Memory Sharing for Concurrent DNN Training
  • [MLSys'20] Salus: Fine-Grained GPU Sharing Primitives for Deep Learning Applications
  • [OSDI'20] AntMan: Dynamic Scaling on GPU Clusters for Deep Learning
  • [OSDI'20] PipeSwitch: Fast Pipelined Context Switching for Deep Learning Applications

2.4 GPU memory management and optimization

2.5 GPU memory usage estimate

  • [ESEC/FSE'20] Estimating GPU memory consumption of deep learning models

2.6 Distributed training (Parallelism)

2024

2023

2022

2021

  • [arxiv'21] Amazon SageMaker Model Parallelism: A General and Flexible Framework for Large Model Training
  • [arxiv'21] GSPMD: General and Scalable Parallelization for ML Computation Graphs
  • [JMLR'21] Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
  • [TPDS'21] TensorOpt: Exploring the Tradeoffs in Distributed DNN Training With Auto-Parallelism
  • [ATC'21] Fine-tuning giant neural networks on commodity hardware with automatic pipeline model parallelism
  • [SIGMOD'21] Heterogeneity-Aware Distributed Machine Learning Training via Partial Reduce [also in 2.10]
  • [MLSys'21] PipeMare: Asynchronous Pipeline Parallel DNN Training
  • [ICLR'21] GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
  • [NeurIPS'21] Piper: Multidimensional Planner for DNN Parallelization
  • [ICML'21] Memory-Efficient Pipeline-Parallel DNN Training
  • [ICML'21] TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models
  • [ICML'21] PipeTransformer: Automated Elastic Pipelining for Distributed Training of Large-scale Models
  • [SC'21] Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines
  • [SC'21] Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM (PTD-P or Megatron-LM v2)
  • [FAST'21] Behemoth: A Flash-centric Training Accelerator for Extreme-scale DNNs
  • [PPoPP'21] DAPPLE: a pipelined data parallel approach for training large models
  • [VLDB'21] Distributed Deep Learning on Data Systems: A Comparative Analysis of Approaches

2020

  • [HPCA'20] AccPar: Tensor Partitioning for Heterogeneous Deep Learning Accelerators
  • [NeurIPS'20] Efficient Algorithms for Device Placement of DNN Graph Operators
  • [arxiv'20] Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
  • [KDD'20 Tutorial] DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters
  • [VLDB'20] PyTorch Distributed: Experiences on Accelerating Data Parallel Training
  • [OSDI'20] A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters (BytePS)
  • [SOSP'19] PipeDream: Generalized Pipeline Parallelism for DNN Training
  • [NeurIPS'20] Language Models are Few-Shot Learners [From OpenAI]
  • [arxiv'20] Scaling Laws for Neural Language Models [From OpenAI]

~2019

Survey Papers

  • [Survey :mag:] [IJCAI'22] Survey on Effcient Training of Large Neural Networks
  • [Survey :mag:] [ACM CSUR'19] Demystifying Parallel and Distributed Deep Learning
  • [Survey :mag:] [ACM CSUR'19] Scalable Deep Learning on Distributed Infrastructures: Challenges, Techniques, and Tools

2.6 DL job failures

  • [ATC'22] Sibylla: To Retry or Not To Retry on Deep Learning Job Failure
  • [ICSE'20] An Empirical Study on Program Failures of Deep Learning Jobs

2.7 Model checkpointing

  • [FAST'21] CheckFreq: Frequent, Fine-Grained DNN Checkpointing

2.8 AutoML

  • [OSDI'23] Hydro: Surrogate-Based Hyperparameter Tuning Service in Datacenters
  • [NSDI'23] ModelKeeper: Accelerating DNN Training via Automated Training Warmup
  • [OSDI'20] Retiarii: A Deep Learning Exploratory-Training Framework

2.9 Communication optimization

2.10 Energy-efficient DNN training (carbon-aware)

2.11 DNN compiler

2.12 Model pruning and compression

2.13 GNN training system

For comprehensive list of GNN systems papers, refer to https://github.com/chwan1016/awesome-gnn-systems.

2.14 Congestion control for DNN training

3. Inference System

4. Federated Learning

5. Privacy-Preserving ML

6. ML APIs & Application-side Optimization

7. ML for Systems

Others

References

This repository is motivated by:

  • https://github.com/HuaizhengZhang/Awesome-System-for-Machine-Learning
  • https://github.com/S-Lab-System-Group/Awesome-DL-Scheduling-Papers
  • https://github.com/ganler/ResearchReading
  • https://jeongseob.github.io/readings_mlsys.html
  • https://github.com/chwan1016/awesome-gnn-systems
  • https://github.com/ConnollyLeon/awesome-Auto-Parallelism