chainerkfac
chainerkfac copied to clipboard
Error:mpirun noticed that process rank 0 with PID 0 on node NODE56 exited on signal 11 (Segmentation fault)
Hello,
I train the mnist and cifar10 successfully with a single GPU using the chainerkfac. But when I use the chainerkfac to train mnist and cifar10 with multiple GPUs, I met this problem.
mpirun noticed that process rank 0 with PID 0 on node NODE56 exited on signal 11 (Segmentation fault).
The command I used as follow: mpirun -np 4 python train.py --distributed
Hi @windwm, Thank you for trying our K-FAC implementation.
In my environment, I can train mnist with two GPUs using chainerkfac.
$ pip freeze
chainer==7.0.0b1
chainerkfac==0.1
cupy-cuda101==6.1.0
fastrlock==0.4
filelock==3.0.12
mpi4py==3.0.2
numpy==1.17.0rc2
protobuf==3.7.1
six==1.12.0
typing==3.6.6
typing-extensions==3.6.6
First of all, can you train mnist with multi GPUs without K-FAC like this?
$ cd chainer/examples/chainermn/mnist
$ mpirun -np 2 python train_mnist.py --communicator pure_nccl --gpu
Thank you for your answer. According to your advice, I try the command:
cd chainerkfac/examples/mnist
mpirun -np 2 python train.py --communicator pure_nccl --gpu
Then I got the error:
usage: train.py [-h] [--batch_size BATCH_SIZE] [--num_epochs NUM_EPOCHS]
[--snapshot_interval SNAPSHOT_INTERVAL] [--no_cuda]
[--out OUT] [--resume RESUME] [--optimizer OPTIMIZER]
[--arch {mlp,cnn}] [--plot] [--distributed]
train.py: error: unrecognized arguments: --communicator pure_nccl --gpu
My environment.
chainer==7.0.0b1
chainerkfac==0.1
cupy-cuda101==6.1.0
fastrlock==0.4
filelock==3.0.12
mpi4py==3.0.2
numpy==1.17.0rc1
protobuf==3.7.1
six==1.12.0
typing==3.6.6
typing-extensions==3.6.6
Please try without chainerkfac.
I mean you can try multi-GPU MNIST training example provided by Chainer
.
Yes, I train mnist sucessfully using 2 GPUs without K-FAC. But when I try to train MNIST, cifar10 and imagenet with chainerkfac, I still met this problem.
mpirun noticed that process rank 0 with PID 0 on node NODE56 exited on signal 11 (Segmentation fault).