chainerkfac icon indicating copy to clipboard operation
chainerkfac copied to clipboard

Error:mpirun noticed that process rank 0 with PID 0 on node NODE56 exited on signal 11 (Segmentation fault)

Open windwm opened this issue 5 years ago • 4 comments

Hello,

I train the mnist and cifar10 successfully with a single GPU using the chainerkfac. But when I use the chainerkfac to train mnist and cifar10 with multiple GPUs, I met this problem.

mpirun noticed that process rank 0 with PID 0 on node NODE56 exited on signal 11 (Segmentation fault).

The command I used as follow: mpirun -np 4 python train.py --distributed

windwm avatar Jul 15 '19 11:07 windwm

Hi @windwm, Thank you for trying our K-FAC implementation.

In my environment, I can train mnist with two GPUs using chainerkfac.

$ pip freeze
chainer==7.0.0b1
chainerkfac==0.1
cupy-cuda101==6.1.0
fastrlock==0.4
filelock==3.0.12
mpi4py==3.0.2
numpy==1.17.0rc2
protobuf==3.7.1
six==1.12.0
typing==3.6.6
typing-extensions==3.6.6

First of all, can you train mnist with multi GPUs without K-FAC like this?

$ cd chainer/examples/chainermn/mnist
$ mpirun -np 2 python train_mnist.py --communicator pure_nccl --gpu

y1r avatar Jul 18 '19 01:07 y1r

Thank you for your answer. According to your advice, I try the command:

cd chainerkfac/examples/mnist
mpirun -np 2 python train.py --communicator pure_nccl --gpu

Then I got the error:

usage: train.py [-h] [--batch_size BATCH_SIZE] [--num_epochs NUM_EPOCHS]
                [--snapshot_interval SNAPSHOT_INTERVAL] [--no_cuda]
                [--out OUT] [--resume RESUME] [--optimizer OPTIMIZER]
                [--arch {mlp,cnn}] [--plot] [--distributed]
train.py: error: unrecognized arguments: --communicator pure_nccl --gpu

My environment.

chainer==7.0.0b1
chainerkfac==0.1
cupy-cuda101==6.1.0
fastrlock==0.4
filelock==3.0.12
mpi4py==3.0.2
numpy==1.17.0rc1
protobuf==3.7.1
six==1.12.0
typing==3.6.6
typing-extensions==3.6.6

windwm avatar Jul 19 '19 06:07 windwm

Please try without chainerkfac. I mean you can try multi-GPU MNIST training example provided by Chainer.

y1r avatar Jul 19 '19 06:07 y1r

Yes, I train mnist sucessfully using 2 GPUs without K-FAC. But when I try to train MNIST, cifar10 and imagenet with chainerkfac, I still met this problem. mpirun noticed that process rank 0 with PID 0 on node NODE56 exited on signal 11 (Segmentation fault).

windwm avatar Jul 19 '19 08:07 windwm