ModelKeeper icon indicating copy to clipboard operation
ModelKeeper copied to clipboard

A Cluster-Wide Model Manager to Accelerate DNN Training via Automated Training Warmup

ModelKeeper

This repository contains the evaluation artifacts of our NSDI '23 paper "ModelKeeper: Accelerating DNN Training via Automated Training Warmup".

ModelKeeper is being merged as part of FedScale and is actively maintained there. Please try it!

Overview

  • Getting Started
  • Run Experiments
  • Repo Structure
  • Contact

Getting Started

Our install.sh will install the following automatically:

  • Anaconda Package Manager
  • CUDA 10.2

Note: if you prefer different versions of conda and CUDA, please check comments in install.sh for details.

Run the following commands to install ModelKeeper.

source install.sh 
pip install -e .

Run Experiments

Repo Structure

Repo Root
|---- modelkeeper   # Core implementation (e.g., Matcher).
|---- evals         # MK support for different training backends
    |---- ray_tune      # Ray experiments
    |---- nni           # Retiarii experiments
|---- examples      # Toy experiments of model transformation

Notes

please consider to cite our paper if you use the code or data in your research project.

@inproceedings{modelkeeper-nsdi23,
  title={ModelKeeper: Accelerating DNN Training via Automated Training Warmup},
  author={Fan Lai and Yinwei Dai and Harsha V. Madhyastha and Mosharaf Chowdhury},
  booktitle={USENIX Symposium on Networked Systems Design and Implementation (NSDI)},
  year={2023}
}

Contact

Fan Lai ([email protected]) and Yinwei Dai ([email protected]).