distributed-training topic

List distributed-training repositories

HandyRL

282
Stars
41
Forks
Watchers

HandyRL is a handy and simple framework based on Python and PyTorch for distributed reinforcement learning that is applicable to your own environments.

pytorch-sync-batchnorm-example

247
Stars
24
Forks
Watchers

How to use Cross Replica / Synchronized Batchnorm in Pytorch

libai

377
Stars
55
Forks
Watchers

LiBai(李白): A Toolbox for Large-Scale Distributed Parallel Training

deep-gradient-compression

206
Stars
43
Forks
Watchers

[ICLR 2018] Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training

EasyParallelLibrary

252
Stars
49
Forks
Watchers

Easy Parallel Library (EPL) is a general and efficient deep learning framework for distributed model training.

OpenKS

155
Stars
67
Forks
Watchers

OpenKS - 领域可泛化的知识学习与计算引擎

skypilot

7.8k
Stars
624
Forks
Watchers

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 16+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.

Dynamic training with Apache MXNet reduces cost and time for training deep neural networks by leveraging AWS cloud elasticity and scale. The system reduces training cost and time by dynamically updati...

distributed-pytorch

89
Stars
24
Forks
Watchers

Distributed, mixed-precision training with PyTorch

pytorch-model-parallel

76
Stars
19
Forks
Watchers

A memory balanced and communication efficient FullyConnected layer with CrossEntropyLoss model parallel implementation in PyTorch