distributed-training topic
HandyRL
HandyRL is a handy and simple framework based on Python and PyTorch for distributed reinforcement learning that is applicable to your own environments.
pytorch-sync-batchnorm-example
How to use Cross Replica / Synchronized Batchnorm in Pytorch
libai
LiBai(李白): A Toolbox for Large-Scale Distributed Parallel Training
deep-gradient-compression
[ICLR 2018] Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training
EasyParallelLibrary
Easy Parallel Library (EPL) is a general and efficient deep learning framework for distributed model training.
OpenKS
OpenKS - 领域可泛化的知识学习与计算引擎
skypilot
SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 16+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
dynamic-training-with-apache-mxnet-on-aws
Dynamic training with Apache MXNet reduces cost and time for training deep neural networks by leveraging AWS cloud elasticity and scale. The system reduces training cost and time by dynamically updati...
distributed-pytorch
Distributed, mixed-precision training with PyTorch
pytorch-model-parallel
A memory balanced and communication efficient FullyConnected layer with CrossEntropyLoss model parallel implementation in PyTorch