structural-transformer Error when run with multiple GPUs

Error when run with multiple GPUs

Open goodbai-nlp opened this issue 5 years ago • 1 comments

Hi,

I often ran into the following error when starting a multi-GPU training.

Traceback (most recent call last):
  File "train.py", line 116, in <module>
    main(opt)
  File "train.py", line 44, in main
    p.join()
  File "/home/xfbai/anaconda3/envs/torch1.0/lib/python3.6/multiprocessing/process.py", line 124, in join
    res = self._popen.wait(timeout)
  File "/home/xfbai/anaconda3/envs/torch1.0/lib/python3.6/multiprocessing/popen_fork.py", line 50, in wait
    return self.poll(os.WNOHANG if timeout == 0.0 else 0)
  File "/home/xfbai/anaconda3/envs/torch1.0/lib/python3.6/multiprocessing/popen_fork.py", line 28, in poll
    pid, sts = os.waitpid(self.pid, flag)
TypeError: signal_handler() takes 1 positional argument but 3 were given

The parameters I used are:

CUDA_VISIBLE_DEVICES=0,1 python3 train.py \
                        -data $data_prefix \
                        -save_model $model_dir \
                        -world_size 2 \
                        -gpu_ranks 0 1 \
                        -save_checkpoint_steps 5000 \
                        -valid_steps 5000 \
                        -report_every 20 \
                        -keep_checkpoint 50 \
                        -seed 3435 \
                        -train_steps 300000 \
                        -warmup_steps 16000 \
                        --share_decoder_embeddings \
                        -share_embeddings \
                        --position_encoding \
                        --optim adam \
                        -adam_beta1 0.9 \
                        -adam_beta2 0.98 \
                        -decay_method noam \
                        -learning_rate 0.5 \
                        -max_grad_norm 0.0 \
                        -batch_size 4096 \
                        -batch_type tokens \
                        -normalization tokens \
                        -dropout 0.3 \
                        -label_smoothing 0.1 \
                        -max_generator_batches 100 \
                        -param_init 0.0 \
                        -param_init_glorot \
                        -valid_batch_size 8

I got this error on Ubuntu16.04, Python3.6, Pytorch 1.0.1. Can someone help me understand what's the cause of it? I would really appreciate your help, thank you!

Sep 19 '19 09:09 goodbai-nlp

Aha，I see. This is a bug in our code running multi-GPU. There's a " def signal_handler() " function in the " train.py " that you need to change to " def signal_handler(self, signalnum, stackframe) " . We normally use a single GPU for training.

Sep 19 '19 10:09 Amazing-J

structural-transformer structural-transformer copied to clipboard

Error when run with multiple GPUs

structural-transformer
structural-transformer copied to clipboard