tensorboardX icon indicating copy to clipboard operation
tensorboardX copied to clipboard

EOFerror at the end of multiprocessing

Open lorelupo opened this issue 4 years ago • 8 comments

Describe the bug Receiving EOFerror when multiprocessing, at the very end of training.

Minimal runnable code to reproduce the behavior Launching a fairseq training with a simple transformer model on multi-GPU. (I am aware this is not minimal at all, I hope this is enough for you to understand the issue)

Expected behavior Complete training without errors.

Environment

protobuf      3.12.2
torch         1.5.1
torchvision   0.6.0a0+35d732a

Python environment

conda create --name fairenv python=3.8
conda install pytorch torchvision cudatoolkit=10.1 -c pytorch
cd anaconda3/envs/fairenv/lib/python3.8/site-packages/
conda activate fairenv
git clone https://github.com/pytorch/fairseq.git
cd fairseq
pip install --editable .
conda install -c conda-forge tensorboardx 

Log

2020-07-21 12:02:11 | INFO | fairseq.distributed_utils | distributed init (rank 0): tcp://localhost:18821
2020-07-21 12:02:11 | INFO | fairseq.distributed_utils | distributed init (rank 1): tcp://localhost:18821
2020-07-21 12:02:11 | INFO | fairseq.distributed_utils | initialized host decore1 as rank 1
2020-07-21 12:02:11 | INFO | fairseq.distributed_utils | initialized host decore1 as rank 0
2020-07-21 12:02:13 | INFO | fairseq_cli.train | Namespace(activation_dropout=0.0, activation_fn='relu', adam_betas='(0.9, 0.999)', adam_eps=1e-08, adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, all_gather_list_size=16384, arch='transformer_test', attention_dropout=0.0, best_checkpoint_metric='loss', bf16=False, bpe=None, broadcast_buffers=False, bucket_cap_mb=25, checkpoint_suffix='', clip_norm=0.0, cpu=False, criterion='cross_entropy', cross_self_attention=False, curriculum=0, data='data/data-bin/dummy.tokenized', data_buffer_size=10, dataset_impl=None, ddp_backend='c10d', decoder_attention_heads=2, decoder_embed_dim=100, decoder_embed_path=None, decoder_ffn_embed_dim=100, decoder_input_dim=100, decoder_layerdrop=0, decoder_layers=2, decoder_layers_to_keep=None, decoder_learned_pos=False, decoder_normalize_before=False, decoder_output_dim=100, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method='tcp://localhost:18821', distributed_no_spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=2, distributed_wrapper='DDP', dropout=0.1, empty_cache_freq=0, encoder_attention_heads=2, encoder_embed_dim=100, encoder_embed_path=None, encoder_ffn_embed_dim=100, encoder_layerdrop=0, encoder_layers=2, encoder_layers_to_keep=None, encoder_learned_pos=False, encoder_normalize_before=False, eval_bleu=False, eval_bleu_args=None, eval_bleu_detok='space', eval_bleu_detok_args=None, eval_bleu_print_samples=False, eval_bleu_remove_bpe=None, eval_tokenized_bleu=False, fast_stat_sync=False, find_unused_parameters=False, fix_batches_to_gpus=False, fixed_validation_seed=None, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, keep_best_checkpoints=-1, keep_interval_updates=-1, keep_last_epochs=-1, layernorm_embedding=False, left_pad_source='True', left_pad_target='False', load_alignments=False, localsgd_frequency=3, log_format='json', log_interval=100, lr=[0.0001], lr_scheduler='inverse_sqrt', max_epoch=2, max_sentences=None, max_sentences_valid=None, max_source_positions=1000, max_target_positions=1000, max_tokens=1000, max_tokens_valid=1000, max_update=0, maximize_best_checkpoint_metric=False, memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=-1, model_parallel_size=1, no_cross_attention=False, no_epoch_checkpoints=False, no_last_checkpoints=False, no_progress_bar=False, no_save=False, no_save_optimizer_state=False, no_scale_embedding=False, no_seed_provided=False, no_token_positional_embeddings=False, nprocs_per_node=2, num_batch_buckets=0, num_workers=1, optimizer='adam', optimizer_overrides='{}', patience=-1, profile=False, quant_noise_pq=0, quant_noise_pq_block_size=8, quant_noise_scalar=0, quantization_config_path=None, required_batch_size_multiple=8, reset_dataloader=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', save_dir='checkpoints/dummy', save_interval=1, save_interval_updates=0, seed=0, sentence_avg=False, share_all_embeddings=True, share_decoder_input_output_embed=False, skip_invalid_size_inputs_valid_test=False, slowmo_algorithm='LocalSGD', slowmo_momentum=None, source_lang=None, stop_time_hours=0, target_lang=None, task='translation', tensorboard_logdir='checkpoints/dummy/logs', threshold_loss_scale=None, tie_adaptive_weights=False, tokenizer=None, tpu=False, train_subset='train', truncate_source=False, update_freq=[1], upsample_primary=1, use_bmuf=False, use_old_adam=False, user_dir=None, valid_subset='valid', validate_interval=1, warmup_init_lr=1e-07, warmup_updates=4000, weight_decay=0.0)
2020-07-21 12:02:13 | INFO | fairseq.tasks.translation | [fr] dictionary: 4632 types
2020-07-21 12:02:13 | INFO | fairseq.tasks.translation | [en] dictionary: 4632 types
2020-07-21 12:02:13 | INFO | fairseq.data.data_utils | loaded 887 examples from: data/data-bin/dummy.tokenized/valid.fr-en.fr
2020-07-21 12:02:13 | INFO | fairseq.data.data_utils | loaded 887 examples from: data/data-bin/dummy.tokenized/valid.fr-en.en
2020-07-21 12:02:13 | INFO | fairseq.tasks.translation | data/data-bin/dummy.tokenized valid fr-en 887 examples
2020-07-21 12:02:14 | INFO | fairseq_cli.train | TransformerModel(
  (encoder): TransformerEncoder(
    (dropout_module): FairseqDropout()
    (embed_tokens): Embedding(4632, 100, padding_idx=1)
    (embed_positions): SinusoidalPositionalEmbedding()
    (layers): ModuleList(
      (0): TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=100, out_features=100, bias=True)
          (v_proj): Linear(in_features=100, out_features=100, bias=True)
          (q_proj): Linear(in_features=100, out_features=100, bias=True)
          (out_proj): Linear(in_features=100, out_features=100, bias=True)
        )
        (self_attn_layer_norm): LayerNorm((100,), eps=1e-05, elementwise_affine=True)
        (dropout_module): FairseqDropout()
        (activation_dropout_module): FairseqDropout()
        (fc1): Linear(in_features=100, out_features=100, bias=True)
        (fc2): Linear(in_features=100, out_features=100, bias=True)
        (final_layer_norm): LayerNorm((100,), eps=1e-05, elementwise_affine=True)
      )
      (1): TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=100, out_features=100, bias=True)
          (v_proj): Linear(in_features=100, out_features=100, bias=True)
          (q_proj): Linear(in_features=100, out_features=100, bias=True)
          (out_proj): Linear(in_features=100, out_features=100, bias=True)
        )
        (self_attn_layer_norm): LayerNorm((100,), eps=1e-05, elementwise_affine=True)
        (dropout_module): FairseqDropout()
        (activation_dropout_module): FairseqDropout()
        (fc1): Linear(in_features=100, out_features=100, bias=True)
        (fc2): Linear(in_features=100, out_features=100, bias=True)
        (final_layer_norm): LayerNorm((100,), eps=1e-05, elementwise_affine=True)
      )
    )
  )
  (decoder): TransformerDecoder(
    (dropout_module): FairseqDropout()
    (embed_tokens): Embedding(4632, 100, padding_idx=1)
    (embed_positions): SinusoidalPositionalEmbedding()
    (layers): ModuleList(
      (0): TransformerDecoderLayer(
        (dropout_module): FairseqDropout()
        (self_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=100, out_features=100, bias=True)
          (v_proj): Linear(in_features=100, out_features=100, bias=True)
          (q_proj): Linear(in_features=100, out_features=100, bias=True)
          (out_proj): Linear(in_features=100, out_features=100, bias=True)
        )
        (activation_dropout_module): FairseqDropout()
        (self_attn_layer_norm): LayerNorm((100,), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=100, out_features=100, bias=True)
          (v_proj): Linear(in_features=100, out_features=100, bias=True)
          (q_proj): Linear(in_features=100, out_features=100, bias=True)
          (out_proj): Linear(in_features=100, out_features=100, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm((100,), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=100, out_features=100, bias=True)
        (fc2): Linear(in_features=100, out_features=100, bias=True)
        (final_layer_norm): LayerNorm((100,), eps=1e-05, elementwise_affine=True)
      )
      (1): TransformerDecoderLayer(
        (dropout_module): FairseqDropout()
        (self_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=100, out_features=100, bias=True)
          (v_proj): Linear(in_features=100, out_features=100, bias=True)
          (q_proj): Linear(in_features=100, out_features=100, bias=True)
          (out_proj): Linear(in_features=100, out_features=100, bias=True)
        )
        (activation_dropout_module): FairseqDropout()
        (self_attn_layer_norm): LayerNorm((100,), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=100, out_features=100, bias=True)
          (v_proj): Linear(in_features=100, out_features=100, bias=True)
          (q_proj): Linear(in_features=100, out_features=100, bias=True)
          (out_proj): Linear(in_features=100, out_features=100, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm((100,), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=100, out_features=100, bias=True)
        (fc2): Linear(in_features=100, out_features=100, bias=True)
        (final_layer_norm): LayerNorm((100,), eps=1e-05, elementwise_affine=True)
      )
    )
    (output_projection): Linear(in_features=100, out_features=4632, bias=False)
  )
)
2020-07-21 12:02:14 | INFO | fairseq_cli.train | model transformer_test, criterion CrossEntropyCriterion
2020-07-21 12:02:14 | INFO | fairseq_cli.train | num. model params: 788400 (num. trained: 788400)
2020-07-21 12:02:14 | INFO | fairseq.trainer | detected shared parameter: encoder.embed_tokens.weight <- decoder.embed_tokens.weight
2020-07-21 12:02:14 | INFO | fairseq.trainer | detected shared parameter: encoder.embed_tokens.weight <- decoder.output_projection.weight
2020-07-21 12:02:14 | INFO | fairseq.utils | ***********************CUDA enviroments for all 2 workers***********************
2020-07-21 12:02:14 | INFO | fairseq.utils | rank   0: capabilities =  6.1  ; total memory = 11.910 GB ; name = TITAN Xp                                
2020-07-21 12:02:14 | INFO | fairseq.utils | rank   1: capabilities =  6.1  ; total memory = 11.910 GB ; name = TITAN Xp                                
2020-07-21 12:02:14 | INFO | fairseq.utils | ***********************CUDA enviroments for all 2 workers***********************
2020-07-21 12:02:14 | INFO | fairseq_cli.train | training on 2 devices (GPUs/TPUs)
2020-07-21 12:02:14 | INFO | fairseq_cli.train | max tokens per GPU = 1000 and max sentences per GPU = None
2020-07-21 12:02:14 | INFO | fairseq.trainer | no existing checkpoint found checkpoints/dummy/checkpoint_last.pt
2020-07-21 12:02:14 | INFO | fairseq.trainer | loading train data for epoch 1
2020-07-21 12:02:14 | INFO | fairseq.data.data_utils | loaded 954 examples from: data/data-bin/dummy.tokenized/train.fr-en.fr
2020-07-21 12:02:14 | INFO | fairseq.data.data_utils | loaded 954 examples from: data/data-bin/dummy.tokenized/train.fr-en.en
2020-07-21 12:02:14 | INFO | fairseq.tasks.translation | data/data-bin/dummy.tokenized train fr-en 954 examples
2020-07-21 12:02:14 | INFO | fairseq_cli.train | begin training epoch 1
/data1/home/getalp/lupol/anaconda3/envs/fairenv/lib/python3.8/site-packages/fairseq/fairseq/utils.py:303: UserWarning: amp_C fused kernels unavailable, disabling multi_tensor_l2norm; you may get better performance by installing NVIDIA's apex library
  warnings.warn(
/data1/home/getalp/lupol/anaconda3/envs/fairenv/lib/python3.8/site-packages/fairseq/fairseq/utils.py:303: UserWarning: amp_C fused kernels unavailable, disabling multi_tensor_l2norm; you may get better performance by installing NVIDIA's apex library
  warnings.warn(
2020-07-21 12:02:16 | INFO | fairseq_cli.train | begin validation on "valid" subset
2020-07-21 12:02:18 | INFO | valid | {"epoch": 1, "valid_loss": "12.856", "valid_ppl": "7414.09", "valid_wps": "75523.5", "valid_wpb": "1120.8", "valid_bsz": "35.5", "valid_num_updates": "16"}
2020-07-21 12:02:18 | INFO | fairseq_cli.train | begin save checkpoint
2020-07-21 12:02:18 | INFO | fairseq.checkpoint_utils | saved checkpoint checkpoints/dummy/checkpoint1.pt (epoch 1 @ 16 updates, score 12.856) (writing took 0.673235297203064 seconds)
2020-07-21 12:02:18 | INFO | fairseq_cli.train | end of epoch 1 (average epoch stats below)
2020-07-21 12:02:18 | INFO | train | {"epoch": 1, "train_loss": "12.934", "train_ppl": "7823.58", "train_wps": "5812.3", "train_ups": "4.72", "train_wpb": "1246.1", "train_bsz": "59.6", "train_num_updates": "16", "train_lr": "4.996e-07", "train_gnorm": "1.985", "train_train_wall": "1", "train_wall": "5"}
2020-07-21 12:02:18 | INFO | fairseq_cli.train | begin training epoch 1
2020-07-21 12:02:20 | INFO | fairseq_cli.train | begin validation on "valid" subset
2020-07-21 12:02:22 | INFO | valid | {"epoch": 2, "valid_loss": "12.85", "valid_ppl": "7381.96", "valid_wps": "59457.9", "valid_wpb": "1120.8", "valid_bsz": "35.5", "valid_num_updates": "32", "valid_best_loss": "12.85"}
2020-07-21 12:02:22 | INFO | fairseq_cli.train | begin save checkpoint
2020-07-21 12:02:23 | INFO | fairseq.checkpoint_utils | saved checkpoint checkpoints/dummy/checkpoint2.pt (epoch 2 @ 32 updates, score 12.85) (writing took 0.6933992877602577 seconds)
2020-07-21 12:02:23 | INFO | fairseq_cli.train | end of epoch 2 (average epoch stats below)
2020-07-21 12:02:23 | INFO | train | {"epoch": 2, "train_loss": "12.931", "train_ppl": "7807.27", "train_wps": "4502", "train_ups": "3.61", "train_wpb": "1246.1", "train_bsz": "59.6", "train_num_updates": "32", "train_lr": "8.992e-07", "train_gnorm": "1.992", "train_train_wall": "1", "train_wall": "9"}
2020-07-21 12:02:23 | INFO | fairseq_cli.train | done training in 9.2 seconds
Exception in thread Thread-3:
Exception in thread Thread-4:
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/getalp/lupol/anaconda3/envs/fairenv/lib/python3.8/threading.py", line 932, in _bootstrap_inner
  File "/home/getalp/lupol/anaconda3/envs/fairenv/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/home/getalp/lupol/anaconda3/envs/fairenv/lib/python3.8/site-packages/tensorboardX/event_file_writer.py", line 202, in run
    self.run()
  File "/home/getalp/lupol/anaconda3/envs/fairenv/lib/python3.8/site-packages/tensorboardX/event_file_writer.py", line 202, in run
    data = self._queue.get(True, queue_wait_duration)
  File "/home/getalp/lupol/anaconda3/envs/fairenv/lib/python3.8/multiprocessing/queues.py", line 111, in get
    data = self._queue.get(True, queue_wait_duration)
  File "/home/getalp/lupol/anaconda3/envs/fairenv/lib/python3.8/multiprocessing/queues.py", line 111, in get
    res = self._recv_bytes()
  File "/home/getalp/lupol/anaconda3/envs/fairenv/lib/python3.8/multiprocessing/connection.py", line 216, in recv_bytes
    res = self._recv_bytes()
  File "/home/getalp/lupol/anaconda3/envs/fairenv/lib/python3.8/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/home/getalp/lupol/anaconda3/envs/fairenv/lib/python3.8/multiprocessing/connection.py", line 414, in _recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/home/getalp/lupol/anaconda3/envs/fairenv/lib/python3.8/multiprocessing/connection.py", line 414, in _recv_bytes
    buf = self._recv(4)
  File "/home/getalp/lupol/anaconda3/envs/fairenv/lib/python3.8/multiprocessing/connection.py", line 383, in _recv
    buf = self._recv(4)
  File "/home/getalp/lupol/anaconda3/envs/fairenv/lib/python3.8/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError
    raise EOFError
EOFError

lorelupo avatar Jul 21 '20 10:07 lorelupo

I get the same error when using multiprocess TPU training (also using fairseq transformer model)

zcain117 avatar Dec 02 '20 00:12 zcain117

I faced the same error, did you solve it?

kshpv avatar Dec 06 '20 12:12 kshpv

Any solution to this? Didn't have this issue previously

hichiaty avatar Jan 07 '21 14:01 hichiaty

No solutions so far, unfortunately.

lorelupo avatar Jan 07 '21 17:01 lorelupo

Didn't you forget to close a writer as suggested here ? If not, maybe you can try adding a suffix specific to each writer when creating the writer ? Something like: writer = SummaryWriter(log_dir, filename_suffix=f'_{run_id}')

LudoHackathon avatar Jan 11 '21 22:01 LudoHackathon

Does that only happens on multi-GPU environment?

lanpa avatar Mar 14 '21 08:03 lanpa

Yes, I did not have the error on single GPU.

lorelupo avatar Mar 18 '21 07:03 lorelupo

I have downgraded the tensorboardx package to 2.1 and it worked at torch 1.7.1 and cuda 11 environment.

jiminsun avatar May 13 '21 07:05 jiminsun