elastic icon indicating copy to clipboard operation
elastic copied to clipboard

PyTorch elastic training

Results 14 elastic issues
Sort by recently updated
recently updated
newest added

## Description Please add more torch elastic training examples like bert model training in natural language processing. ## Motivation/Background We cannot find other torch elastic examples. ## Alternatives ## Additional...

## 🐛 Bug Component (check all that applies): * [ ] `state api` * [ ] `train_step api` * [ ] `train_loop` * [x] `rendezvous` * [ ] `checkpoint` *...

## 🐛 Bug I specify ttlSecondsAfterFinished in my ElasticJob spec like so: ```apiVersion: elastic.pytorch.org/v1alpha1 kind: ElasticJob metadata: name: job1 namespace: username spec: rdzvEndpoint: "etcd0.elastic-job:2379" minReplicas: 1 maxReplicas: 1 ttlSecondsAfterFinished: 10...

## Description Add CPU ImageNet or MNIST example ## Motivation/Background Now we only have GPU ImageNet example, which requires multiple GPUs to test the feature. It is better to support...

cc @Jeffwan I think we share the similar scope between pytorch-operator and elasticjob-operator. I am wondering if we can collaborate to support PyTorch and PyTorch elastic well.

Fixes https://github.com/pytorch/elastic/issues/158

cla signed

This is useful to not think about to avoid submitting unnecessary code modifications to AWS.

If this repo is deprecated, it may make sense to move out cloud scripts to a not-deprecated one. Having the most-simple-but-working automated cloud setup is very useful.

## 🐛 Bug The submodule `docs/src/pytorch-sphinx-theme` seems to be broken. This prevents the following command from working: ```console $ kubectl kustomize https://github.com/pytorch/elastic.git/kubernetes/config/default?ref=v0.2.2 ``` (workaround is to clone the repo first,...

Hi, I have the following error when I try to run my code with torchelastic: ``` Creating EtcdStore as the c10d::Store implementation Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/torchelastic/distributed/launch.py", line...