tensorboardX
tensorboardX copied to clipboard
EOFerror at the end of multiprocessing
Describe the bug Receiving EOFerror when multiprocessing, at the very end of training.
Minimal runnable code to reproduce the behavior Launching a fairseq training with a simple transformer model on multi-GPU. (I am aware this is not minimal at all, I hope this is enough for you to understand the issue)
Expected behavior Complete training without errors.
Environment
protobuf 3.12.2
torch 1.5.1
torchvision 0.6.0a0+35d732a
Python environment
conda create --name fairenv python=3.8
conda install pytorch torchvision cudatoolkit=10.1 -c pytorch
cd anaconda3/envs/fairenv/lib/python3.8/site-packages/
conda activate fairenv
git clone https://github.com/pytorch/fairseq.git
cd fairseq
pip install --editable .
conda install -c conda-forge tensorboardx
Log
2020-07-21 12:02:11 | INFO | fairseq.distributed_utils | distributed init (rank 0): tcp://localhost:18821
2020-07-21 12:02:11 | INFO | fairseq.distributed_utils | distributed init (rank 1): tcp://localhost:18821
2020-07-21 12:02:11 | INFO | fairseq.distributed_utils | initialized host decore1 as rank 1
2020-07-21 12:02:11 | INFO | fairseq.distributed_utils | initialized host decore1 as rank 0
2020-07-21 12:02:13 | INFO | fairseq_cli.train | Namespace(activation_dropout=0.0, activation_fn='relu', adam_betas='(0.9, 0.999)', adam_eps=1e-08, adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, all_gather_list_size=16384, arch='transformer_test', attention_dropout=0.0, best_checkpoint_metric='loss', bf16=False, bpe=None, broadcast_buffers=False, bucket_cap_mb=25, checkpoint_suffix='', clip_norm=0.0, cpu=False, criterion='cross_entropy', cross_self_attention=False, curriculum=0, data='data/data-bin/dummy.tokenized', data_buffer_size=10, dataset_impl=None, ddp_backend='c10d', decoder_attention_heads=2, decoder_embed_dim=100, decoder_embed_path=None, decoder_ffn_embed_dim=100, decoder_input_dim=100, decoder_layerdrop=0, decoder_layers=2, decoder_layers_to_keep=None, decoder_learned_pos=False, decoder_normalize_before=False, decoder_output_dim=100, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method='tcp://localhost:18821', distributed_no_spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=2, distributed_wrapper='DDP', dropout=0.1, empty_cache_freq=0, encoder_attention_heads=2, encoder_embed_dim=100, encoder_embed_path=None, encoder_ffn_embed_dim=100, encoder_layerdrop=0, encoder_layers=2, encoder_layers_to_keep=None, encoder_learned_pos=False, encoder_normalize_before=False, eval_bleu=False, eval_bleu_args=None, eval_bleu_detok='space', eval_bleu_detok_args=None, eval_bleu_print_samples=False, eval_bleu_remove_bpe=None, eval_tokenized_bleu=False, fast_stat_sync=False, find_unused_parameters=False, fix_batches_to_gpus=False, fixed_validation_seed=None, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, keep_best_checkpoints=-1, keep_interval_updates=-1, keep_last_epochs=-1, layernorm_embedding=False, left_pad_source='True', left_pad_target='False', load_alignments=False, localsgd_frequency=3, log_format='json', log_interval=100, lr=[0.0001], lr_scheduler='inverse_sqrt', max_epoch=2, max_sentences=None, max_sentences_valid=None, max_source_positions=1000, max_target_positions=1000, max_tokens=1000, max_tokens_valid=1000, max_update=0, maximize_best_checkpoint_metric=False, memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=-1, model_parallel_size=1, no_cross_attention=False, no_epoch_checkpoints=False, no_last_checkpoints=False, no_progress_bar=False, no_save=False, no_save_optimizer_state=False, no_scale_embedding=False, no_seed_provided=False, no_token_positional_embeddings=False, nprocs_per_node=2, num_batch_buckets=0, num_workers=1, optimizer='adam', optimizer_overrides='{}', patience=-1, profile=False, quant_noise_pq=0, quant_noise_pq_block_size=8, quant_noise_scalar=0, quantization_config_path=None, required_batch_size_multiple=8, reset_dataloader=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', save_dir='checkpoints/dummy', save_interval=1, save_interval_updates=0, seed=0, sentence_avg=False, share_all_embeddings=True, share_decoder_input_output_embed=False, skip_invalid_size_inputs_valid_test=False, slowmo_algorithm='LocalSGD', slowmo_momentum=None, source_lang=None, stop_time_hours=0, target_lang=None, task='translation', tensorboard_logdir='checkpoints/dummy/logs', threshold_loss_scale=None, tie_adaptive_weights=False, tokenizer=None, tpu=False, train_subset='train', truncate_source=False, update_freq=[1], upsample_primary=1, use_bmuf=False, use_old_adam=False, user_dir=None, valid_subset='valid', validate_interval=1, warmup_init_lr=1e-07, warmup_updates=4000, weight_decay=0.0)
2020-07-21 12:02:13 | INFO | fairseq.tasks.translation | [fr] dictionary: 4632 types
2020-07-21 12:02:13 | INFO | fairseq.tasks.translation | [en] dictionary: 4632 types
2020-07-21 12:02:13 | INFO | fairseq.data.data_utils | loaded 887 examples from: data/data-bin/dummy.tokenized/valid.fr-en.fr
2020-07-21 12:02:13 | INFO | fairseq.data.data_utils | loaded 887 examples from: data/data-bin/dummy.tokenized/valid.fr-en.en
2020-07-21 12:02:13 | INFO | fairseq.tasks.translation | data/data-bin/dummy.tokenized valid fr-en 887 examples
2020-07-21 12:02:14 | INFO | fairseq_cli.train | TransformerModel(
(encoder): TransformerEncoder(
(dropout_module): FairseqDropout()
(embed_tokens): Embedding(4632, 100, padding_idx=1)
(embed_positions): SinusoidalPositionalEmbedding()
(layers): ModuleList(
(0): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=100, out_features=100, bias=True)
(v_proj): Linear(in_features=100, out_features=100, bias=True)
(q_proj): Linear(in_features=100, out_features=100, bias=True)
(out_proj): Linear(in_features=100, out_features=100, bias=True)
)
(self_attn_layer_norm): LayerNorm((100,), eps=1e-05, elementwise_affine=True)
(dropout_module): FairseqDropout()
(activation_dropout_module): FairseqDropout()
(fc1): Linear(in_features=100, out_features=100, bias=True)
(fc2): Linear(in_features=100, out_features=100, bias=True)
(final_layer_norm): LayerNorm((100,), eps=1e-05, elementwise_affine=True)
)
(1): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=100, out_features=100, bias=True)
(v_proj): Linear(in_features=100, out_features=100, bias=True)
(q_proj): Linear(in_features=100, out_features=100, bias=True)
(out_proj): Linear(in_features=100, out_features=100, bias=True)
)
(self_attn_layer_norm): LayerNorm((100,), eps=1e-05, elementwise_affine=True)
(dropout_module): FairseqDropout()
(activation_dropout_module): FairseqDropout()
(fc1): Linear(in_features=100, out_features=100, bias=True)
(fc2): Linear(in_features=100, out_features=100, bias=True)
(final_layer_norm): LayerNorm((100,), eps=1e-05, elementwise_affine=True)
)
)
)
(decoder): TransformerDecoder(
(dropout_module): FairseqDropout()
(embed_tokens): Embedding(4632, 100, padding_idx=1)
(embed_positions): SinusoidalPositionalEmbedding()
(layers): ModuleList(
(0): TransformerDecoderLayer(
(dropout_module): FairseqDropout()
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=100, out_features=100, bias=True)
(v_proj): Linear(in_features=100, out_features=100, bias=True)
(q_proj): Linear(in_features=100, out_features=100, bias=True)
(out_proj): Linear(in_features=100, out_features=100, bias=True)
)
(activation_dropout_module): FairseqDropout()
(self_attn_layer_norm): LayerNorm((100,), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=100, out_features=100, bias=True)
(v_proj): Linear(in_features=100, out_features=100, bias=True)
(q_proj): Linear(in_features=100, out_features=100, bias=True)
(out_proj): Linear(in_features=100, out_features=100, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((100,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=100, out_features=100, bias=True)
(fc2): Linear(in_features=100, out_features=100, bias=True)
(final_layer_norm): LayerNorm((100,), eps=1e-05, elementwise_affine=True)
)
(1): TransformerDecoderLayer(
(dropout_module): FairseqDropout()
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=100, out_features=100, bias=True)
(v_proj): Linear(in_features=100, out_features=100, bias=True)
(q_proj): Linear(in_features=100, out_features=100, bias=True)
(out_proj): Linear(in_features=100, out_features=100, bias=True)
)
(activation_dropout_module): FairseqDropout()
(self_attn_layer_norm): LayerNorm((100,), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=100, out_features=100, bias=True)
(v_proj): Linear(in_features=100, out_features=100, bias=True)
(q_proj): Linear(in_features=100, out_features=100, bias=True)
(out_proj): Linear(in_features=100, out_features=100, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((100,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=100, out_features=100, bias=True)
(fc2): Linear(in_features=100, out_features=100, bias=True)
(final_layer_norm): LayerNorm((100,), eps=1e-05, elementwise_affine=True)
)
)
(output_projection): Linear(in_features=100, out_features=4632, bias=False)
)
)
2020-07-21 12:02:14 | INFO | fairseq_cli.train | model transformer_test, criterion CrossEntropyCriterion
2020-07-21 12:02:14 | INFO | fairseq_cli.train | num. model params: 788400 (num. trained: 788400)
2020-07-21 12:02:14 | INFO | fairseq.trainer | detected shared parameter: encoder.embed_tokens.weight <- decoder.embed_tokens.weight
2020-07-21 12:02:14 | INFO | fairseq.trainer | detected shared parameter: encoder.embed_tokens.weight <- decoder.output_projection.weight
2020-07-21 12:02:14 | INFO | fairseq.utils | ***********************CUDA enviroments for all 2 workers***********************
2020-07-21 12:02:14 | INFO | fairseq.utils | rank 0: capabilities = 6.1 ; total memory = 11.910 GB ; name = TITAN Xp
2020-07-21 12:02:14 | INFO | fairseq.utils | rank 1: capabilities = 6.1 ; total memory = 11.910 GB ; name = TITAN Xp
2020-07-21 12:02:14 | INFO | fairseq.utils | ***********************CUDA enviroments for all 2 workers***********************
2020-07-21 12:02:14 | INFO | fairseq_cli.train | training on 2 devices (GPUs/TPUs)
2020-07-21 12:02:14 | INFO | fairseq_cli.train | max tokens per GPU = 1000 and max sentences per GPU = None
2020-07-21 12:02:14 | INFO | fairseq.trainer | no existing checkpoint found checkpoints/dummy/checkpoint_last.pt
2020-07-21 12:02:14 | INFO | fairseq.trainer | loading train data for epoch 1
2020-07-21 12:02:14 | INFO | fairseq.data.data_utils | loaded 954 examples from: data/data-bin/dummy.tokenized/train.fr-en.fr
2020-07-21 12:02:14 | INFO | fairseq.data.data_utils | loaded 954 examples from: data/data-bin/dummy.tokenized/train.fr-en.en
2020-07-21 12:02:14 | INFO | fairseq.tasks.translation | data/data-bin/dummy.tokenized train fr-en 954 examples
2020-07-21 12:02:14 | INFO | fairseq_cli.train | begin training epoch 1
/data1/home/getalp/lupol/anaconda3/envs/fairenv/lib/python3.8/site-packages/fairseq/fairseq/utils.py:303: UserWarning: amp_C fused kernels unavailable, disabling multi_tensor_l2norm; you may get better performance by installing NVIDIA's apex library
warnings.warn(
/data1/home/getalp/lupol/anaconda3/envs/fairenv/lib/python3.8/site-packages/fairseq/fairseq/utils.py:303: UserWarning: amp_C fused kernels unavailable, disabling multi_tensor_l2norm; you may get better performance by installing NVIDIA's apex library
warnings.warn(
2020-07-21 12:02:16 | INFO | fairseq_cli.train | begin validation on "valid" subset
2020-07-21 12:02:18 | INFO | valid | {"epoch": 1, "valid_loss": "12.856", "valid_ppl": "7414.09", "valid_wps": "75523.5", "valid_wpb": "1120.8", "valid_bsz": "35.5", "valid_num_updates": "16"}
2020-07-21 12:02:18 | INFO | fairseq_cli.train | begin save checkpoint
2020-07-21 12:02:18 | INFO | fairseq.checkpoint_utils | saved checkpoint checkpoints/dummy/checkpoint1.pt (epoch 1 @ 16 updates, score 12.856) (writing took 0.673235297203064 seconds)
2020-07-21 12:02:18 | INFO | fairseq_cli.train | end of epoch 1 (average epoch stats below)
2020-07-21 12:02:18 | INFO | train | {"epoch": 1, "train_loss": "12.934", "train_ppl": "7823.58", "train_wps": "5812.3", "train_ups": "4.72", "train_wpb": "1246.1", "train_bsz": "59.6", "train_num_updates": "16", "train_lr": "4.996e-07", "train_gnorm": "1.985", "train_train_wall": "1", "train_wall": "5"}
2020-07-21 12:02:18 | INFO | fairseq_cli.train | begin training epoch 1
2020-07-21 12:02:20 | INFO | fairseq_cli.train | begin validation on "valid" subset
2020-07-21 12:02:22 | INFO | valid | {"epoch": 2, "valid_loss": "12.85", "valid_ppl": "7381.96", "valid_wps": "59457.9", "valid_wpb": "1120.8", "valid_bsz": "35.5", "valid_num_updates": "32", "valid_best_loss": "12.85"}
2020-07-21 12:02:22 | INFO | fairseq_cli.train | begin save checkpoint
2020-07-21 12:02:23 | INFO | fairseq.checkpoint_utils | saved checkpoint checkpoints/dummy/checkpoint2.pt (epoch 2 @ 32 updates, score 12.85) (writing took 0.6933992877602577 seconds)
2020-07-21 12:02:23 | INFO | fairseq_cli.train | end of epoch 2 (average epoch stats below)
2020-07-21 12:02:23 | INFO | train | {"epoch": 2, "train_loss": "12.931", "train_ppl": "7807.27", "train_wps": "4502", "train_ups": "3.61", "train_wpb": "1246.1", "train_bsz": "59.6", "train_num_updates": "32", "train_lr": "8.992e-07", "train_gnorm": "1.992", "train_train_wall": "1", "train_wall": "9"}
2020-07-21 12:02:23 | INFO | fairseq_cli.train | done training in 9.2 seconds
Exception in thread Thread-3:
Exception in thread Thread-4:
Traceback (most recent call last):
Traceback (most recent call last):
File "/home/getalp/lupol/anaconda3/envs/fairenv/lib/python3.8/threading.py", line 932, in _bootstrap_inner
File "/home/getalp/lupol/anaconda3/envs/fairenv/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/home/getalp/lupol/anaconda3/envs/fairenv/lib/python3.8/site-packages/tensorboardX/event_file_writer.py", line 202, in run
self.run()
File "/home/getalp/lupol/anaconda3/envs/fairenv/lib/python3.8/site-packages/tensorboardX/event_file_writer.py", line 202, in run
data = self._queue.get(True, queue_wait_duration)
File "/home/getalp/lupol/anaconda3/envs/fairenv/lib/python3.8/multiprocessing/queues.py", line 111, in get
data = self._queue.get(True, queue_wait_duration)
File "/home/getalp/lupol/anaconda3/envs/fairenv/lib/python3.8/multiprocessing/queues.py", line 111, in get
res = self._recv_bytes()
File "/home/getalp/lupol/anaconda3/envs/fairenv/lib/python3.8/multiprocessing/connection.py", line 216, in recv_bytes
res = self._recv_bytes()
File "/home/getalp/lupol/anaconda3/envs/fairenv/lib/python3.8/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/home/getalp/lupol/anaconda3/envs/fairenv/lib/python3.8/multiprocessing/connection.py", line 414, in _recv_bytes
buf = self._recv_bytes(maxlength)
File "/home/getalp/lupol/anaconda3/envs/fairenv/lib/python3.8/multiprocessing/connection.py", line 414, in _recv_bytes
buf = self._recv(4)
File "/home/getalp/lupol/anaconda3/envs/fairenv/lib/python3.8/multiprocessing/connection.py", line 383, in _recv
buf = self._recv(4)
File "/home/getalp/lupol/anaconda3/envs/fairenv/lib/python3.8/multiprocessing/connection.py", line 383, in _recv
raise EOFError
EOFError
raise EOFError
EOFError
I get the same error when using multiprocess TPU training (also using fairseq transformer model)
I faced the same error, did you solve it?
Any solution to this? Didn't have this issue previously
No solutions so far, unfortunately.
Didn't you forget to close a writer as suggested here ?
If not, maybe you can try adding a suffix specific to each writer when creating the writer ? Something like:
writer = SummaryWriter(log_dir, filename_suffix=f'_{run_id}')
Does that only happens on multi-GPU environment?
Yes, I did not have the error on single GPU.
I have downgraded the tensorboardx package to 2.1 and it worked at torch 1.7.1 and cuda 11 environment.