ModelKeeper
                                
                                 ModelKeeper copied to clipboard
                                
                                    ModelKeeper copied to clipboard
                            
                            
                            
                        A Cluster-Wide Model Manager to Accelerate DNN Training via Automated Training Warmup
ModelKeeper
This repository contains the evaluation artifacts of our NSDI '23 paper "ModelKeeper: Accelerating DNN Training via Automated Training Warmup".
ModelKeeper is being merged as part of FedScale and is actively maintained there. Please try it!
Overview
- Getting Started
- Run Experiments
- Repo Structure
- Contact
Getting Started
Our install.sh will install the following automatically:
- Anaconda Package Manager
- CUDA 10.2
Note: if you prefer different versions of conda and CUDA, please check  comments in install.sh for details.
Run the following commands to install ModelKeeper.
source install.sh 
pip install -e .
Run Experiments
Repo Structure
Repo Root
|---- modelkeeper   # Core implementation (e.g., Matcher).
|---- evals         # MK support for different training backends
    |---- ray_tune      # Ray experiments
    |---- nni           # Retiarii experiments
|---- examples      # Toy experiments of model transformation
Notes
please consider to cite our paper if you use the code or data in your research project.
@inproceedings{modelkeeper-nsdi23,
  title={ModelKeeper: Accelerating DNN Training via Automated Training Warmup},
  author={Fan Lai and Yinwei Dai and Harsha V. Madhyastha and Mosharaf Chowdhury},
  booktitle={USENIX Symposium on Networked Systems Design and Implementation (NSDI)},
  year={2023}
}
Contact
Fan Lai ([email protected]) and Yinwei Dai ([email protected]).