ignite icon indicating copy to clipboard operation
ignite copied to clipboard

Unable to create DiskSaver when program launched with torch.distributed.launcher

Open sandylaker opened this issue 4 years ago • 27 comments

🐛 Bug description

As mentioned in this issue in MONAI, I tried to run this tutorial code with torch.distributed.launcher. However, the program froze at instantiating the CheckpointSaver. The reason was that DiskSaver of ignite cannot be created when the program is launched with torch.distributed.launcher (I am using SLURM). I also noticed that it might be caused by calling get_rank() in the one_rank_only decorator, which is used in the definition of DiskSaver: https://github.com/pytorch/ignite/blob/d16d15efbbbfc476702e91f3ab2bc757b839be26/ignite/distributed/utils.py#L595

I also did a simple experiment to verify this. I launched the following script with srun python -m torch.distributed.launcher --nproc_per_node=4 --nnodes=1 script.py, and found that the program froze at creating the DiskSaver.

import torch.distributed as dist
from ignite.handlers import DiskSaver
from argparse import ArgumentParser


def create_disk_saver(args):
	dist.init_process_group(backend='nccl', init_method='env://')

	if dist.get_rank() == 0:
		print('building DiskSaver')
		disk_saver = DiskSaver(dirname='./runs/')
		print('DiskSaver built')

	dist.destroy_process_group()


def main():
	parser = ArgumentParser()
	parser.add_argument('--local_rank', type=int)
	args = parser.parse_args()
	create_disk_saver(args)


if __name__ == '__main__':
	main()

I would be much appreciated if you could fix this. I prefer launching the program with torch.distributed.launcher to ignite.distributed.Parallel context manager, as it has less issues with the SLURM env.

Environment

  • PyTorch Version (e.g., 1.4): 1.8
  • Ignite Version (e.g., 0.3.0): 0.4.4
  • OS (e.g., Linux): Linux
  • How you installed Ignite (conda, pip, source): pip
  • Python version: 3.8
  • Any other relevant information:

sandylaker avatar Jun 05 '21 03:06 sandylaker

@sandylaker Thank you for opening this issue. I will have a look asap and try to reproduce on my side.

Moreover, could you explain a bit more the issues you faced using slurm and ignite.distributed.Parallel ?

sdesrozis avatar Jun 05 '21 08:06 sdesrozis

Just looking your code, it can't work if you create the DiskSaver in a if section only restricted to one process. It seems that DiskSaver needs a collective __init__ call.

sdesrozis avatar Jun 05 '21 09:06 sdesrozis

@sandylaker Thank you for opening this issue. I will have a look asap and try to reproduce on my side.

Moreover, could you explain a bit more the issues you faced using slurm and ignite.distributed.Parallel ?

Thank you very much. The slurm issue is already mentioned in here, so one should not set nproc_per_node and nnodes when using SLURM and idist.Parallel. This issue is actually fixed by a recent PR. I tested with multi-node training and it worked. But when I tested with single-node multiple GPU training, with command like srun -N1 -n4 python train.py, the job kept waiting in queue and cannot be launched.

Back to this issue, I did an additional experiment, where I re-implemented the one_rank_only and DiskSaver by replacing the get_rank() withtorch.distributed.get_rank(), and then launched the program with torch.distributed.launcher. It worked fine, this in return verified my hypothesis.

sandylaker avatar Jun 05 '21 09:06 sandylaker

My bad. I had a look ant it seemed that my answer was enough. https://github.com/pytorch/ignite/issues/1968#issuecomment-828425106

Don't you agree ?

sdesrozis avatar Jun 05 '21 12:06 sdesrozis

Back to this issue, I did an additional experiment, where I re-implemented the one_rank_only and DiskSaver by replacing the get_rank() withtorch.distributed.get_rank(), and then launched the program with torch.distributed.launcher. It worked fine, this in return verified my hypothesis.

I will try to test that today on my side and I tell you.

sdesrozis avatar Jun 05 '21 12:06 sdesrozis

@sandylaker could you please test this code with nightly version : pip install --pre pytorch-ignite ? I think it should raise this runtime error: https://github.com/pytorch/ignite/blob/d16d15efbbbfc476702e91f3ab2bc757b839be26/ignite/distributed/comp_models/native.py#L218-L222

In general, I think calling srun python -m torch.distributed.launcher --nproc_per_node=4 --nnodes=1 script.py is incorrect as srun create new job with 1 proc per node and torch.distributed.launcher spawns 4 proc per node. What do you think ?

vfdev-5 avatar Jun 06 '21 08:06 vfdev-5

@sandylaker could you please test this code with nightly version : pip install --pre pytorch-ignite ? I think it should raise this runtime error:

https://github.com/pytorch/ignite/blob/d16d15efbbbfc476702e91f3ab2bc757b839be26/ignite/distributed/comp_models/native.py#L218-L222

In general, I think calling srun python -m torch.distributed.launcher --nproc_per_node=4 --nnodes=1 script.py is incorrect as srun create new job with 1 proc per node and torch.distributed.launcher spawns 4 proc per node. What do you think ?

  1. Yes, it threw the Error as expected. But when I changed to ignite.distributed.Parallel context manager, and launched the script with srun -N=1 -n=4 --ntasks-per-node=4 script.py, the job got stuck in the queue and kept waiting for resources to be allocated. But actually the 4-GPU node was free all the time. That's the reason why I used torch.distributed.launcher to launch DDP training.
  2. I would like to run the script on a single node which contains 4 GPUs.

sandylaker avatar Jun 06 '21 09:06 sandylaker

  1. Yes, it threw the Error as expected. But when I changed to ignite.distributed.Parallel context manager, and launched the script with srun -N=1 -n=4 --ntasks-per-node=4 script.py, the job got stuck in the queue and kept waiting for resources to be allocated. But actually the 4-GPU node was free all the time. That's the reason why I used torch.distributed.launcher to launch DDP training.

If the script doesn't run and stays in the queue, IMO it's not a problem from ignite. Did you try with another script ?

sdesrozis avatar Jun 06 '21 11:06 sdesrozis

@sdesrozis Yeah it might be caused by the SLURM settings. I did not run another script with ignite.distributed.Parallel. But I have run multiple scripts with torch.distributed.launcher, and they worked quite well. Thus, I would like to know if ignite allows DDP training with torch.distributed.launcher. Or I have to stick to ignite.distributed.Parallel?

sandylaker avatar Jun 06 '21 11:06 sandylaker

You can do as you prefer, but using ignite.distributed.Parallel, you would be able to use torch.distributed.launch, torch.distributed.spawn, slurm, xla and horovod as well, with a unique code.

Please have a look here https://github.com/sdesrozis/why-ignite/tree/main/basics

We are currently finishing writing a blog article explaining how ignite can help about parallel computing.

HTH

sdesrozis avatar Jun 06 '21 12:06 sdesrozis

I will try tomorrow on the cluster of my lab and I tell you. Sorry for the delay, I can't access so easily.

sdesrozis avatar Jun 06 '21 16:06 sdesrozis

@sdesrozis @sandylaker I think I can understand now the issue with DiskSaver and the above code snippet. Yes, it wont work and will freeze due to the fact that DiskSaver uses one_rank_only decorator. What happens is the following:

	dist.init_process_group(backend='nccl', init_method='env://')
        # idist is not aware of this distributed group as no idist methods were called to setup comp model from the context

	if dist.get_rank() == 0:
		print('building DiskSaver')             
		disk_saver = DiskSaver(dirname='./runs/')
                # DIskSaver has few methods decorated with one_rank_only
                # inside this method we call idist.get_rank() which is dispatched to _NativeDistModel._init_from_context()
                # and especially _NativeDistModel._compute_nproc_per_node where we set up an attribute with a collective op: 
                # dist.all_reduce -> can not work on a single rank.

To make the code snippet work, we have to remove if dist.get_rank() == 0.

Another way, to make previous code work:

import torch.distributed as dist
import ignite.distributed as idist
from ignite.handlers import DiskSaver
from argparse import ArgumentParser


def create_disk_saver(args):
	dist.init_process_group(backend='nccl', init_method='env://')

        # sync distributed context
        idist.sync()

	if dist.get_rank() == 0:
		print('building DiskSaver')
		disk_saver = DiskSaver(dirname='./runs/')
		print('DiskSaver built')

	dist.destroy_process_group()


def main():
	parser = ArgumentParser()
	parser.add_argument('--local_rank', type=int)
	args = parser.parse_args()
	create_disk_saver(args)


if __name__ == '__main__':
	main()

On the other hand, we say in our docs about Checkpoint that :

This class is distributed configuration-friendly: it is not required to instantiate the class in rank 0 only process. This class supports automatically distributed configuration and if used with DiskSaver, checkpoint is stored by rank 0 process.

but this does not clearly say that DiskSaver should be created on all ranks for torch native ddp.

PS: @sandylaker thanks a lot for raising again this issue !

vfdev-5 avatar Jun 06 '21 22:06 vfdev-5

Just looking your code, it can't work if you create the DiskSaver in a if section only restricted to one process. It seems that DiskSaver needs a collective __init__ call.

@vfdev-5 Yes, That's what I mentioned looking the code a few days ago. However you explained it better 😊

The parallel / sequential sections remain a tricky (and classical) thing in parallel computing. Having to manage the 2 behaviours (collective call similar to reduction or guard per processor) makes the codes more complicated. An idea would be to have only handlers defined in collective, we avoid the if clauses and it's simpler.

Although I don't know if the bug label should be added to this issue.

Last thing, I didn't understand how idist.sync() would help, it doesn't remove the collective code section ?

sdesrozis avatar Jun 07 '21 05:06 sdesrozis

@sdesrozis Thank you for your explanation. I removed the if block and run with srun python -m torch.distributed.launch --nproc_per_node=4 --nnodes=1 script.py, but got following error:

RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1616554793803/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, invalid usage, NCCL versi
on 2.7.8                                                                                                                                       
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams
 in a group, etc).         

sandylaker avatar Jun 07 '21 06:06 sandylaker

Last thing, I didn't understand how idist.sync() would help, it doesn't remove the collective code section ?

@sdesrozis, in this particular code, as I added notes in the code, idist is not aware of dist config until we define DIskSaver. And one_rank_only implicitly calls _NativeDistModel._compute_nproc_per_node() from _NativeDistModel._init_from_context(). If we sync with idist.sync() then _NativeDistModel._init_from_context() is called by all procs and everything should be OK.

vfdev-5 avatar Jun 07 '21 06:06 vfdev-5

@sdesrozis Thank you for your explanation. I removed the if block and run with srun python -m torch.distributed.launch --nproc_per_node=4 --nnodes=1 script.py, but got following error:

RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1616554793803/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, invalid usage, NCCL versi
on 2.7.8                                                                                                                                       
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams
 in a group, etc).         

@sandylaker maybe you have to specify torch.cuda.set_device(local_rank) to make NCCL happy ?

vfdev-5 avatar Jun 07 '21 07:06 vfdev-5

@sdesrozis Thank you for your explanation. I removed the if block and run with srun python -m torch.distributed.launch --nproc_per_node=4 --nnodes=1 script.py, but got following error:

RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1616554793803/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, invalid usage, NCCL versi
on 2.7.8                                                                                                                                       
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams
 in a group, etc).         

@sandylaker maybe you have to specify torch.cuda.set_device(local_rank) to make NCCL happy ?

@vfdev-5 I tried, but got the same error.

sandylaker avatar Jun 07 '21 07:06 sandylaker

Well, it is very strange. Meanwhile you can try to use "gloo" as a backend. Which pytorch version do you have and cuda version ?

vfdev-5 avatar Jun 07 '21 07:06 vfdev-5

@vfdev-5 pytorch 1.8.1

sandylaker avatar Jun 07 '21 07:06 sandylaker

@sandylaker which cuda version ?

I can not reproduce the issue with NCCL with my setup using 1.8.1 and cuda 11.1 and 2 GPUs.

vfdev-5 avatar Jun 07 '21 07:06 vfdev-5

@sandylaker which cuda version ?

I can not reproduce the issue with NCCL with my setup using 1.8.1 and cuda 11.1 and 2 GPUs.

CUDA 11.0

sandylaker avatar Jun 07 '21 08:06 sandylaker

@sdesrozis, in this particular code, as I added notes in the code, idist is not aware of dist config until we define DIskSaver. And one_rank_only implicitly calls _NativeDistModel._compute_nproc_per_node() from _NativeDistModel._init_from_context().

If we sync with idist.sync() then _NativeDistModel._init_from_context() is called by all procs and everything should be OK.

Yes but the if clause in your example should freeze the code, shouldn't it ?

EDIT : It blocks if with_barrier=True with one_rank_only, I guessed it was the default !! My bad.

sdesrozis avatar Jun 07 '21 10:06 sdesrozis

@sdesrozis Thank you for your explanation. I removed the if block and run with srun python -m torch.distributed.launch --nproc_per_node=4 --nnodes=1 script.py, but got following error:


RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1616554793803/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, invalid usage, NCCL versi

on 2.7.8                                                                                                                                       

ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams

 in a group, etc).         

srun is a launcher similar to torch.launch.distributed. The interaction between these two tools could be quite weird (but explainable). See for instance https://github.com/sdesrozis/why-ignite/tree/main/basics/1_torchdistributed#more-advanced--limited-compatibility-with-slurm

I don't know what is the default srun configuration. I suppose it's one node but maybe it could lead to this result with torch.launch.distributed. Did you try to set one node one task configuration to srun ?

sdesrozis avatar Jun 07 '21 10:06 sdesrozis

@sdesrozis So it is like this: (a) srun -N1 -n4 --ntasks_per_node=4: job gets stucked in queue; (b) srun -N4 -n4 --ntasks_per_node=1: works for multi-node.

But as I said, I have sucessfully run programs written with naive pytorch distributed module. They worked well with both single-node and multi-node settings.

sandylaker avatar Jun 07 '21 11:06 sandylaker

@sandylaker I tried a few runs on the cluster of my company.

1 - using srun and torch.launch.distributed without ignite.distributed.Parallel

script, README

srun -N1 -n1 python -m torch.distributed.launch --nproc_per_node 2 helloworld.py

> [http://127.0.0.1:29500] hello from [ener021:nccl] process 0/2
> [http://127.0.0.1:29500] hello from [ener021:nccl] process 1/2

NOTE : I removed some mandatory options like -J, -p, --mem, etc. related to the own configuration of my cluster.

srun -N1 -n1 python -m torch.distributed.launch --nproc_per_node 8 helloworld.py --backend="gloo"

> [http://127.0.0.1:29500] hello from [ener021:gloo] process 0/8
> [http://127.0.0.1:29500] hello from [ener021:gloo] process 1/8
> [http://127.0.0.1:29500] hello from [ener021:gloo] process 2/8
> [http://127.0.0.1:29500] hello from [ener021:gloo] process 3/8
> [http://127.0.0.1:29500] hello from [ener021:gloo] process 4/8
> [http://127.0.0.1:29500] hello from [ener021:gloo] process 5/8
> [http://127.0.0.1:29500] hello from [ener021:gloo] process 6/8
> [http://127.0.0.1:29500] hello from [ener021:gloo] process 7/8

2 - using srun without torch.launch.distributed and ignite.distributed.Parallel

script, README

srun -N1 -n2 python helloworld.py

> [http://ener021:22163] hello from [ener021:nccl] process 0/2
> [http://ener021:22163] hello from [ener021:nccl] process 1/2
srun -N1 -n8 python helloworld.py --backend="gloo"

> [http://ener021:22165] hello from [ener021:gloo] process 0/8
> [http://ener021:22165] hello from [ener021:gloo] process 1/8
> [http://ener021:22165] hello from [ener021:gloo] process 2/8
> [http://ener021:22165] hello from [ener021:gloo] process 3/8
> [http://ener021:22165] hello from [ener021:gloo] process 4/8
> [http://ener021:22165] hello from [ener021:gloo] process 5/8
> [http://ener021:22165] hello from [ener021:gloo] process 6/8
> [http://ener021:22165] hello from [ener021:gloo] process 7/8

3 - using srun and torch.launch.distributed with ignite.distributed.Parallel

script, README

One script, both usages.

On a computing node, use torch.launch.distributed

python -m torch.distributed.launch --nproc_per_node 2 --use_env helloworld.py

> 2021-06-08 08:57:28,548 ignite.distributed.launcher.Parallel INFO: Initialized processing group with backend: 'nccl'
> 2021-06-08 08:57:28,548 ignite.distributed.launcher.Parallel INFO: - Run '<function main_fn at 0x2aac7e5bf4c0>' in 2 processes
> [http://127.0.0.1:29500] hello from [ener021:nccl] process 0/2
> [http://127.0.0.1:29500] hello from [ener021:nccl] process 1/2
> 2021-06-08 08:57:28,548 ignite.distributed.launcher.Parallel INFO: End of run
> 2021-06-08 08:57:28,548 ignite.distributed.launcher.Parallel INFO: Finalized processing group with backend: 'nccl'
python -m torch.distributed.launch --nproc_per_node 8 --use_env helloworld.py --backend="gloo"

> 2021-06-08 08:58:22,682 ignite.distributed.launcher.Parallel INFO: Initialized processing group with backend: 'gloo'
> 2021-06-08 08:58:22,683 ignite.distributed.launcher.Parallel INFO: - Run '<function main_fn at 0x2b0ae40ec4c0>' in 8 processes
> [http://127.0.0.1:29500] hello from [ener021:gloo] process 0/8
> [http://127.0.0.1:29500] hello from [ener021:gloo] process 1/8
> [http://127.0.0.1:29500] hello from [ener021:gloo] process 2/8
> [http://127.0.0.1:29500] hello from [ener021:gloo] process 3/8
> [http://127.0.0.1:29500] hello from [ener021:gloo] process 4/8
> [http://127.0.0.1:29500] hello from [ener021:gloo] process 5/8
> [http://127.0.0.1:29500] hello from [ener021:gloo] process 6/8
> [http://127.0.0.1:29500] hello from [ener021:gloo] process 7/8
> 2021-06-08 08:58:22,685 ignite.distributed.launcher.Parallel INFO: End of run
> 2021-06-08 08:58:22,685 ignite.distributed.launcher.Parallel INFO: Finalized processing group with backend: 'gloo'

On the frontend, use srun (or sbatch)

srun -N1 -n2 python helloworld.py

> 2021-06-08 09:00:56,121 ignite.distributed.launcher.Parallel INFO: Initialized processing group with backend: 'nccl'
> 2021-06-08 09:00:56,121 ignite.distributed.launcher.Parallel INFO: - Run '<function main_fn at 0x2b10e3ce34c0>' in 2 processes
> [http://ener021:22182] hello from [ener021:nccl] process 0/2
> [http://ener021:22182] hello from [ener021:nccl] process 1/2
> 2021-06-08 09:00:56,132 ignite.distributed.launcher.Parallel INFO: End of run
> 2021-06-08 09:00:56,132 ignite.distributed.launcher.Parallel INFO: Finalized processing group with backend: 'nccl'
srun -N1 -n8 python helloworld.py --backend="gloo"

> 2021-06-08 09:02:26,940 ignite.distributed.launcher.Parallel INFO: Initialized processing group with backend: 'gloo'
> 2021-06-08 09:02:26,941 ignite.distributed.launcher.Parallel INFO: - Run '<function main_fn at 0x2b3c2f58f4c0>' in 8 processes
> [http://ener021:22185] hello from [ener021:gloo] process 0/8
> [http://ener021:22185] hello from [ener021:gloo] process 1/8
> [http://ener021:22185] hello from [ener021:gloo] process 2/8
> [http://ener021:22185] hello from [ener021:gloo] process 3/8
> [http://ener021:22185] hello from [ener021:gloo] process 4/8
> [http://ener021:22185] hello from [ener021:gloo] process 5/8
> [http://ener021:22185] hello from [ener021:gloo] process 6/8
> [http://ener021:22185] hello from [ener021:gloo] process 7/8
> 2021-06-08 09:02:26,946 ignite.distributed.launcher.Parallel INFO: End of run
> 2021-06-08 09:02:26,947 ignite.distributed.launcher.Parallel INFO: Finalized processing group with backend: 'gloo'

HTH

sdesrozis avatar Jun 08 '21 07:06 sdesrozis

@sdesrozis Thank you very much for the detailed experiments. I suppose there might be some incorrect configuration of the SLURM on my server.

sandylaker avatar Jun 08 '21 07:06 sandylaker

The usefull options from my side

  • --wckey to ping the runs to the projects from our side
  • -J to give a authorized job name
  • -p to select to the good partition
  • --gres=gpu:2 to select the gpus
  • --mem to add some memory because a low memory linux is used

sdesrozis avatar Jun 08 '21 07:06 sdesrozis