structural-transformer
structural-transformer copied to clipboard
Error when run with multiple GPUs
Hi,
I often ran into the following error when starting a multi-GPU training.
Traceback (most recent call last):
File "train.py", line 116, in <module>
main(opt)
File "train.py", line 44, in main
p.join()
File "/home/xfbai/anaconda3/envs/torch1.0/lib/python3.6/multiprocessing/process.py", line 124, in join
res = self._popen.wait(timeout)
File "/home/xfbai/anaconda3/envs/torch1.0/lib/python3.6/multiprocessing/popen_fork.py", line 50, in wait
return self.poll(os.WNOHANG if timeout == 0.0 else 0)
File "/home/xfbai/anaconda3/envs/torch1.0/lib/python3.6/multiprocessing/popen_fork.py", line 28, in poll
pid, sts = os.waitpid(self.pid, flag)
TypeError: signal_handler() takes 1 positional argument but 3 were given
The parameters I used are:
CUDA_VISIBLE_DEVICES=0,1 python3 train.py \
-data $data_prefix \
-save_model $model_dir \
-world_size 2 \
-gpu_ranks 0 1 \
-save_checkpoint_steps 5000 \
-valid_steps 5000 \
-report_every 20 \
-keep_checkpoint 50 \
-seed 3435 \
-train_steps 300000 \
-warmup_steps 16000 \
--share_decoder_embeddings \
-share_embeddings \
--position_encoding \
--optim adam \
-adam_beta1 0.9 \
-adam_beta2 0.98 \
-decay_method noam \
-learning_rate 0.5 \
-max_grad_norm 0.0 \
-batch_size 4096 \
-batch_type tokens \
-normalization tokens \
-dropout 0.3 \
-label_smoothing 0.1 \
-max_generator_batches 100 \
-param_init 0.0 \
-param_init_glorot \
-valid_batch_size 8
I got this error on Ubuntu16.04, Python3.6, Pytorch 1.0.1. Can someone help me understand what's the cause of it? I would really appreciate your help, thank you!
Aha,I see. This is a bug in our code running multi-GPU. There's a " def signal_handler() " function in the " train.py " that you need to change to " def signal_handler(self, signalnum, stackframe) " . We normally use a single GPU for training.