ignite
ignite copied to clipboard
Unable to create DiskSaver when program launched with torch.distributed.launcher
🐛 Bug description
As mentioned in this issue in MONAI, I tried to run this tutorial code with torch.distributed.launcher. However, the program froze at instantiating the CheckpointSaver. The reason was that DiskSaver of ignite cannot be created when the program is launched with torch.distributed.launcher (I am using SLURM). I also noticed that it might be caused by calling get_rank() in the one_rank_only decorator, which is used in the definition of DiskSaver: https://github.com/pytorch/ignite/blob/d16d15efbbbfc476702e91f3ab2bc757b839be26/ignite/distributed/utils.py#L595
I also did a simple experiment to verify this. I launched the following script with srun python -m torch.distributed.launcher --nproc_per_node=4 --nnodes=1 script.py, and found that the program froze at creating the DiskSaver.
import torch.distributed as dist
from ignite.handlers import DiskSaver
from argparse import ArgumentParser
def create_disk_saver(args):
dist.init_process_group(backend='nccl', init_method='env://')
if dist.get_rank() == 0:
print('building DiskSaver')
disk_saver = DiskSaver(dirname='./runs/')
print('DiskSaver built')
dist.destroy_process_group()
def main():
parser = ArgumentParser()
parser.add_argument('--local_rank', type=int)
args = parser.parse_args()
create_disk_saver(args)
if __name__ == '__main__':
main()
I would be much appreciated if you could fix this. I prefer launching the program with torch.distributed.launcher to ignite.distributed.Parallel context manager, as it has less issues with the SLURM env.
Environment
- PyTorch Version (e.g., 1.4): 1.8
- Ignite Version (e.g., 0.3.0): 0.4.4
- OS (e.g., Linux): Linux
- How you installed Ignite (
conda,pip, source): pip - Python version: 3.8
- Any other relevant information:
@sandylaker Thank you for opening this issue. I will have a look asap and try to reproduce on my side.
Moreover, could you explain a bit more the issues you faced using slurm and ignite.distributed.Parallel ?
Just looking your code, it can't work if you create the DiskSaver in a if section only restricted to one process. It seems that DiskSaver needs a collective __init__ call.
@sandylaker Thank you for opening this issue. I will have a look asap and try to reproduce on my side.
Moreover, could you explain a bit more the issues you faced using slurm and ignite.distributed.Parallel ?
Thank you very much. The slurm issue is already mentioned in here, so one should not set nproc_per_node and nnodes when using SLURM and idist.Parallel. This issue is actually fixed by a recent PR. I tested with multi-node training and it worked. But when I tested with single-node multiple GPU training, with command like srun -N1 -n4 python train.py, the job kept waiting in queue and cannot be launched.
Back to this issue, I did an additional experiment, where I re-implemented the one_rank_only and DiskSaver by replacing the get_rank() withtorch.distributed.get_rank(), and then launched the program with torch.distributed.launcher. It worked fine, this in return verified my hypothesis.
My bad. I had a look ant it seemed that my answer was enough. https://github.com/pytorch/ignite/issues/1968#issuecomment-828425106
Don't you agree ?
Back to this issue, I did an additional experiment, where I re-implemented the
one_rank_onlyandDiskSaverby replacing theget_rank()withtorch.distributed.get_rank(), and then launched the program withtorch.distributed.launcher. It worked fine, this in return verified my hypothesis.
I will try to test that today on my side and I tell you.
@sandylaker could you please test this code with nightly version : pip install --pre pytorch-ignite ?
I think it should raise this runtime error:
https://github.com/pytorch/ignite/blob/d16d15efbbbfc476702e91f3ab2bc757b839be26/ignite/distributed/comp_models/native.py#L218-L222
In general, I think calling srun python -m torch.distributed.launcher --nproc_per_node=4 --nnodes=1 script.py is incorrect as srun create new job with 1 proc per node and torch.distributed.launcher spawns 4 proc per node. What do you think ?
@sandylaker could you please test this code with nightly version :
pip install --pre pytorch-ignite? I think it should raise this runtime error:https://github.com/pytorch/ignite/blob/d16d15efbbbfc476702e91f3ab2bc757b839be26/ignite/distributed/comp_models/native.py#L218-L222
In general, I think calling
srun python -m torch.distributed.launcher --nproc_per_node=4 --nnodes=1 script.pyis incorrect assruncreate new job with 1 proc per node andtorch.distributed.launcherspawns 4 proc per node. What do you think ?
- Yes, it threw the Error as expected. But when I changed to
ignite.distributed.Parallelcontext manager, and launched the script withsrun -N=1 -n=4 --ntasks-per-node=4 script.py, the job got stuck in the queue and kept waiting for resources to be allocated. But actually the 4-GPU node was free all the time. That's the reason why I usedtorch.distributed.launcherto launch DDP training. - I would like to run the script on a single node which contains 4 GPUs.
- Yes, it threw the Error as expected. But when I changed to
ignite.distributed.Parallelcontext manager, and launched the script withsrun -N=1 -n=4 --ntasks-per-node=4 script.py, the job got stuck in the queue and kept waiting for resources to be allocated. But actually the 4-GPU node was free all the time. That's the reason why I usedtorch.distributed.launcherto launch DDP training.
If the script doesn't run and stays in the queue, IMO it's not a problem from ignite. Did you try with another script ?
@sdesrozis Yeah it might be caused by the SLURM settings. I did not run another script with ignite.distributed.Parallel. But I have run multiple scripts with torch.distributed.launcher, and they worked quite well. Thus, I would like to know if ignite allows DDP training with torch.distributed.launcher. Or I have to stick to ignite.distributed.Parallel?
You can do as you prefer, but using ignite.distributed.Parallel, you would be able to use torch.distributed.launch, torch.distributed.spawn, slurm, xla and horovod as well, with a unique code.
Please have a look here https://github.com/sdesrozis/why-ignite/tree/main/basics
We are currently finishing writing a blog article explaining how ignite can help about parallel computing.
HTH
I will try tomorrow on the cluster of my lab and I tell you. Sorry for the delay, I can't access so easily.
@sdesrozis @sandylaker I think I can understand now the issue with DiskSaver and the above code snippet.
Yes, it wont work and will freeze due to the fact that DiskSaver uses one_rank_only decorator.
What happens is the following:
dist.init_process_group(backend='nccl', init_method='env://')
# idist is not aware of this distributed group as no idist methods were called to setup comp model from the context
if dist.get_rank() == 0:
print('building DiskSaver')
disk_saver = DiskSaver(dirname='./runs/')
# DIskSaver has few methods decorated with one_rank_only
# inside this method we call idist.get_rank() which is dispatched to _NativeDistModel._init_from_context()
# and especially _NativeDistModel._compute_nproc_per_node where we set up an attribute with a collective op:
# dist.all_reduce -> can not work on a single rank.
To make the code snippet work, we have to remove if dist.get_rank() == 0.
Another way, to make previous code work:
import torch.distributed as dist
import ignite.distributed as idist
from ignite.handlers import DiskSaver
from argparse import ArgumentParser
def create_disk_saver(args):
dist.init_process_group(backend='nccl', init_method='env://')
# sync distributed context
idist.sync()
if dist.get_rank() == 0:
print('building DiskSaver')
disk_saver = DiskSaver(dirname='./runs/')
print('DiskSaver built')
dist.destroy_process_group()
def main():
parser = ArgumentParser()
parser.add_argument('--local_rank', type=int)
args = parser.parse_args()
create_disk_saver(args)
if __name__ == '__main__':
main()
On the other hand, we say in our docs about Checkpoint that :
This class is distributed configuration-friendly: it is not required to instantiate the class in rank 0 only process. This class supports automatically distributed configuration and if used with DiskSaver, checkpoint is stored by rank 0 process.
but this does not clearly say that DiskSaver should be created on all ranks for torch native ddp.
PS: @sandylaker thanks a lot for raising again this issue !
Just looking your code, it can't work if you create the
DiskSaverin a if section only restricted to one process. It seems thatDiskSaverneeds a collective__init__call.
@vfdev-5 Yes, That's what I mentioned looking the code a few days ago. However you explained it better 😊
The parallel / sequential sections remain a tricky (and classical) thing in parallel computing. Having to manage the 2 behaviours (collective call similar to reduction or guard per processor) makes the codes more complicated. An idea would be to have only handlers defined in collective, we avoid the if clauses and it's simpler.
Although I don't know if the bug label should be added to this issue.
Last thing, I didn't understand how idist.sync() would help, it doesn't remove the collective code section ?
@sdesrozis Thank you for your explanation. I removed the if block and run with srun python -m torch.distributed.launch --nproc_per_node=4 --nnodes=1 script.py, but got following error:
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1616554793803/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, invalid usage, NCCL versi
on 2.7.8
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams
in a group, etc).
Last thing, I didn't understand how idist.sync() would help, it doesn't remove the collective code section ?
@sdesrozis, in this particular code, as I added notes in the code, idist is not aware of dist config until we define DIskSaver. And one_rank_only implicitly calls _NativeDistModel._compute_nproc_per_node() from _NativeDistModel._init_from_context().
If we sync with idist.sync() then _NativeDistModel._init_from_context() is called by all procs and everything should be OK.
@sdesrozis Thank you for your explanation. I removed the
ifblock and run withsrun python -m torch.distributed.launch --nproc_per_node=4 --nnodes=1 script.py, but got following error:RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1616554793803/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, invalid usage, NCCL versi on 2.7.8 ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
@sandylaker maybe you have to specify torch.cuda.set_device(local_rank) to make NCCL happy ?
@sdesrozis Thank you for your explanation. I removed the
ifblock and run withsrun python -m torch.distributed.launch --nproc_per_node=4 --nnodes=1 script.py, but got following error:RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1616554793803/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, invalid usage, NCCL versi on 2.7.8 ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).@sandylaker maybe you have to specify
torch.cuda.set_device(local_rank)to make NCCL happy ?
@vfdev-5 I tried, but got the same error.
Well, it is very strange. Meanwhile you can try to use "gloo" as a backend. Which pytorch version do you have and cuda version ?
@vfdev-5 pytorch 1.8.1
@sandylaker which cuda version ?
I can not reproduce the issue with NCCL with my setup using 1.8.1 and cuda 11.1 and 2 GPUs.
@sandylaker which cuda version ?
I can not reproduce the issue with NCCL with my setup using 1.8.1 and cuda 11.1 and 2 GPUs.
CUDA 11.0
@sdesrozis, in this particular code, as I added notes in the code, idist is not aware of dist config until we define DIskSaver. And one_rank_only implicitly calls
_NativeDistModel._compute_nproc_per_node()from_NativeDistModel._init_from_context().If we sync with
idist.sync()then_NativeDistModel._init_from_context()is called by all procs and everything should be OK.
Yes but the if clause in your example should freeze the code, shouldn't it ?
EDIT :
It blocks if with_barrier=True with one_rank_only, I guessed it was the default !! My bad.
@sdesrozis Thank you for your explanation. I removed the
ifblock and run withsrun python -m torch.distributed.launch --nproc_per_node=4 --nnodes=1 script.py, but got following error:RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1616554793803/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, invalid usage, NCCL versi on 2.7.8 ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
srun is a launcher similar to torch.launch.distributed. The interaction between these two tools could be quite weird (but explainable). See for instance https://github.com/sdesrozis/why-ignite/tree/main/basics/1_torchdistributed#more-advanced--limited-compatibility-with-slurm
I don't know what is the default srun configuration. I suppose it's one node but maybe it could lead to this result with torch.launch.distributed. Did you try to set one node one task configuration to srun ?
@sdesrozis So it is like this:
(a) srun -N1 -n4 --ntasks_per_node=4: job gets stucked in queue; (b) srun -N4 -n4 --ntasks_per_node=1: works for multi-node.
But as I said, I have sucessfully run programs written with naive pytorch distributed module. They worked well with both single-node and multi-node settings.
@sandylaker I tried a few runs on the cluster of my company.
1 - using srun and torch.launch.distributed without ignite.distributed.Parallel
srun -N1 -n1 python -m torch.distributed.launch --nproc_per_node 2 helloworld.py
> [http://127.0.0.1:29500] hello from [ener021:nccl] process 0/2
> [http://127.0.0.1:29500] hello from [ener021:nccl] process 1/2
NOTE : I removed some mandatory options like -J, -p, --mem, etc. related to the own configuration of my cluster.
srun -N1 -n1 python -m torch.distributed.launch --nproc_per_node 8 helloworld.py --backend="gloo"
> [http://127.0.0.1:29500] hello from [ener021:gloo] process 0/8
> [http://127.0.0.1:29500] hello from [ener021:gloo] process 1/8
> [http://127.0.0.1:29500] hello from [ener021:gloo] process 2/8
> [http://127.0.0.1:29500] hello from [ener021:gloo] process 3/8
> [http://127.0.0.1:29500] hello from [ener021:gloo] process 4/8
> [http://127.0.0.1:29500] hello from [ener021:gloo] process 5/8
> [http://127.0.0.1:29500] hello from [ener021:gloo] process 6/8
> [http://127.0.0.1:29500] hello from [ener021:gloo] process 7/8
2 - using srun without torch.launch.distributed and ignite.distributed.Parallel
srun -N1 -n2 python helloworld.py
> [http://ener021:22163] hello from [ener021:nccl] process 0/2
> [http://ener021:22163] hello from [ener021:nccl] process 1/2
srun -N1 -n8 python helloworld.py --backend="gloo"
> [http://ener021:22165] hello from [ener021:gloo] process 0/8
> [http://ener021:22165] hello from [ener021:gloo] process 1/8
> [http://ener021:22165] hello from [ener021:gloo] process 2/8
> [http://ener021:22165] hello from [ener021:gloo] process 3/8
> [http://ener021:22165] hello from [ener021:gloo] process 4/8
> [http://ener021:22165] hello from [ener021:gloo] process 5/8
> [http://ener021:22165] hello from [ener021:gloo] process 6/8
> [http://ener021:22165] hello from [ener021:gloo] process 7/8
3 - using srun and torch.launch.distributed with ignite.distributed.Parallel
One script, both usages.
On a computing node, use torch.launch.distributed
python -m torch.distributed.launch --nproc_per_node 2 --use_env helloworld.py
> 2021-06-08 08:57:28,548 ignite.distributed.launcher.Parallel INFO: Initialized processing group with backend: 'nccl'
> 2021-06-08 08:57:28,548 ignite.distributed.launcher.Parallel INFO: - Run '<function main_fn at 0x2aac7e5bf4c0>' in 2 processes
> [http://127.0.0.1:29500] hello from [ener021:nccl] process 0/2
> [http://127.0.0.1:29500] hello from [ener021:nccl] process 1/2
> 2021-06-08 08:57:28,548 ignite.distributed.launcher.Parallel INFO: End of run
> 2021-06-08 08:57:28,548 ignite.distributed.launcher.Parallel INFO: Finalized processing group with backend: 'nccl'
python -m torch.distributed.launch --nproc_per_node 8 --use_env helloworld.py --backend="gloo"
> 2021-06-08 08:58:22,682 ignite.distributed.launcher.Parallel INFO: Initialized processing group with backend: 'gloo'
> 2021-06-08 08:58:22,683 ignite.distributed.launcher.Parallel INFO: - Run '<function main_fn at 0x2b0ae40ec4c0>' in 8 processes
> [http://127.0.0.1:29500] hello from [ener021:gloo] process 0/8
> [http://127.0.0.1:29500] hello from [ener021:gloo] process 1/8
> [http://127.0.0.1:29500] hello from [ener021:gloo] process 2/8
> [http://127.0.0.1:29500] hello from [ener021:gloo] process 3/8
> [http://127.0.0.1:29500] hello from [ener021:gloo] process 4/8
> [http://127.0.0.1:29500] hello from [ener021:gloo] process 5/8
> [http://127.0.0.1:29500] hello from [ener021:gloo] process 6/8
> [http://127.0.0.1:29500] hello from [ener021:gloo] process 7/8
> 2021-06-08 08:58:22,685 ignite.distributed.launcher.Parallel INFO: End of run
> 2021-06-08 08:58:22,685 ignite.distributed.launcher.Parallel INFO: Finalized processing group with backend: 'gloo'
On the frontend, use srun (or sbatch)
srun -N1 -n2 python helloworld.py
> 2021-06-08 09:00:56,121 ignite.distributed.launcher.Parallel INFO: Initialized processing group with backend: 'nccl'
> 2021-06-08 09:00:56,121 ignite.distributed.launcher.Parallel INFO: - Run '<function main_fn at 0x2b10e3ce34c0>' in 2 processes
> [http://ener021:22182] hello from [ener021:nccl] process 0/2
> [http://ener021:22182] hello from [ener021:nccl] process 1/2
> 2021-06-08 09:00:56,132 ignite.distributed.launcher.Parallel INFO: End of run
> 2021-06-08 09:00:56,132 ignite.distributed.launcher.Parallel INFO: Finalized processing group with backend: 'nccl'
srun -N1 -n8 python helloworld.py --backend="gloo"
> 2021-06-08 09:02:26,940 ignite.distributed.launcher.Parallel INFO: Initialized processing group with backend: 'gloo'
> 2021-06-08 09:02:26,941 ignite.distributed.launcher.Parallel INFO: - Run '<function main_fn at 0x2b3c2f58f4c0>' in 8 processes
> [http://ener021:22185] hello from [ener021:gloo] process 0/8
> [http://ener021:22185] hello from [ener021:gloo] process 1/8
> [http://ener021:22185] hello from [ener021:gloo] process 2/8
> [http://ener021:22185] hello from [ener021:gloo] process 3/8
> [http://ener021:22185] hello from [ener021:gloo] process 4/8
> [http://ener021:22185] hello from [ener021:gloo] process 5/8
> [http://ener021:22185] hello from [ener021:gloo] process 6/8
> [http://ener021:22185] hello from [ener021:gloo] process 7/8
> 2021-06-08 09:02:26,946 ignite.distributed.launcher.Parallel INFO: End of run
> 2021-06-08 09:02:26,947 ignite.distributed.launcher.Parallel INFO: Finalized processing group with backend: 'gloo'
HTH
@sdesrozis Thank you very much for the detailed experiments. I suppose there might be some incorrect configuration of the SLURM on my server.
The usefull options from my side
--wckeyto ping the runs to the projects from our side-Jto give a authorized job name-pto select to the good partition--gres=gpu:2to select the gpus--memto add some memory because a low memory linux is used