elastic icon indicating copy to clipboard operation
elastic copied to clipboard

PyTorch elastic training

Results 14 elastic issues
Sort by recently updated
recently updated
newest added

### Question I followed the tutorial and used the following command to launch the torchelastic: ``` export NUM_TRAINERS=2 python -m torchelastic.distributed.launch \ --nnodes=1:4 \ --nproc_per_node=$NUM_TRAINERS \ --rdzv_id=1 \ --rdzv_backend=etcd \...

## Description Currently, when the cluster membership change occurs, the agent will kill all the workers that run users script, perform rank redistribution and spawn them again. Since there is...

## ❓ Questions and Help ### Please note that this issue tracker is not a help form and this issue will be closed. Before submitting, please ensure you have gone...