elastic
elastic copied to clipboard
PyTorch elastic training
### Question I followed the tutorial and used the following command to launch the torchelastic: ``` export NUM_TRAINERS=2 python -m torchelastic.distributed.launch \ --nnodes=1:4 \ --nproc_per_node=$NUM_TRAINERS \ --rdzv_id=1 \ --rdzv_backend=etcd \...
## Description Currently, when the cluster membership change occurs, the agent will kill all the workers that run users script, perform rank redistribution and spawn them again. Since there is...
## ❓ Questions and Help ### Please note that this issue tracker is not a help form and this issue will be closed. Before submitting, please ensure you have gone...