[Transformer/Translation] OSError: [Errno 28] No space left on device

Open hulihan-start opened this issue 2 years ago • 2 comments

Related to Model/Framework(s) PyTorch/Translation/Transformer

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| +---------------------------------------------------------------------------------------+ Namespace(adam_betas=[0.9, 0.997], adam_eps=1e-09, amp=True, arch='transformer_wmt_en_de_big_t2t', attention_dropout=0.1, beam=4, bpe_codes=None, buffer_size=64, clip_norm=0.0, cpu=False, criterion='label_smoothed_cross_entropy', data='/data/', decoder_attention_heads=16, decoder_embed_dim=1024, decoder_embed_path=None, decoder_ffn_embed_dim=4096, decoder_layers=6, decoder_learned_pos=False, decoder_normalize_before=True, distributed_rank=0, distributed_world_size=1, do_sanity_check=False, dropout=0.1, encoder_attention_heads=16, encoder_embed_dim=1024, encoder_embed_path=None, encoder_ffn_embed_dim=4096, encoder_layers=6, encoder_learned_pos=False, encoder_normalize_before=True, file=None, fp16=False, fuse_dropout_add=False, fuse_layer_norm=True, fuse_relu_dropout=False, gen_subset='test', label_smoothing=0.1, left_pad_source=True, left_pad_target=False, lenpen=1, local_rank=0, log_interval=500, lr=[0.000846], lr_scheduler='inverse_sqrt', lr_shrink=0.1, max_epoch=30, max_len_a=0, max_len_b=200, max_positions=(1024, 1024), max_sentences=None, max_sentences_valid=None, max_source_positions=1024, max_target_positions=1024, max_tokens=256, max_update=0, min_len=1, min_lr=0.0, momentum=0.99, nbest=1, no_beamable_mm=False, no_early_stop=False, no_epoch_checkpoints=False, no_save=True, no_token_positional_embeddings=False, num_shards=1, online_eval=True, optimizer='adam', pad_sequence=1, path=None, prefix_size=0, print_alignment=False, quiet=False, raw_text=False, relu_dropout=0.1, remove_bpe='@@ ', replace_unk=None, restore_file='checkpoint_last.pt', sampling=False, sampling_temperature=1, sampling_topk=-1, save_dir='/results', save_interval=1, save_predictions=False, seed=1, shard_id=0, share_all_embeddings=True, share_decoder_input_output_embed=False, source_lang=None, stat_file='/results/run_log.json', target_bleu=0.0, target_lang=None, test_cased_bleu=False, train_subset='train', unkpen=0, unnormalized=False, update_freq=[1], valid_subset='valid', validate_interval=1, warmup_init_lr=0.0, warmup_updates=4000, weight_decay=0.0) Traceback (most recent call last): File "/workspace/translation/train.py", line 433, in main(ARGS) File "/workspace/translation/train.py", line 47, in main setup_logger(args) File "/workspace/translation/fairseq/log_helper.py", line 176, in setup_logger TensorBoardBackend(verbosity=1, log_dir=args.save_dir)]) File "/workspace/translation/fairseq/log_helper.py", line 131, in init self.summary_writer = SummaryWriter(log_dir=os.path.join(log_dir, 'TB_summary'), File "/opt/conda/lib/python3.8/site-packages/tensorboardX/writer.py", line 300, in init self._get_file_writer() File "/opt/conda/lib/python3.8/site-packages/tensorboardX/writer.py", line 348, in _get_file_writer self.file_writer = FileWriter(logdir=self.logdir, File "/opt/conda/lib/python3.8/site-packages/tensorboardX/writer.py", line 104, in init self.event_writer = EventFileWriter( File "/opt/conda/lib/python3.8/site-packages/tensorboardX/event_file_writer.py", line 105, in init self._event_queue = multiprocessing.Queue(max_queue_size) File "/opt/conda/lib/python3.8/multiprocessing/context.py", line 103, in Queue return Queue(maxsize, ctx=self.get_context()) File "/opt/conda/lib/python3.8/multiprocessing/queues.py", line 48, in init self._sem = ctx.BoundedSemaphore(maxsize) File "/opt/conda/lib/python3.8/multiprocessing/context.py", line 88, in BoundedSemaphore return BoundedSemaphore(value, ctx=self.get_context()) File "/opt/conda/lib/python3.8/multiprocessing/synchronize.py", line 145, in init SemLock.init(self, SEMAPHORE, value, value, ctx=ctx) File "/opt/conda/lib/python3.8/multiprocessing/synchronize.py", line 57, in init sl = self._semlock = _multiprocessing.SemLock( OSError: [Errno 28] No space left on device ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 429) of binary: /opt/conda/bin/python Traceback (most recent call last): File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 765, in main() File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper return f(*args, kwargs) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main run(args) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run elastic_launch( File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/workspace/translation/train.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2023-08-08_23:35:37 host : b0c9f9852e1d rank : 0 (local_rank: 0) exitcode : 1 (pid: 429) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

To Reproduce Steps to reproduce the behavior: In this docker file, you should change the numpy version to 1.22.1. Try run pip install numpy==1.22.1 I just build the environment followed by README.md, and then run bash scripts/run_training.sh

Expected behavior I cannot test the training.

Environment Please provide at least:

Container version: FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:22.06-py3
GPUs in the system: 4x RTX 3090-24GB
CUDA driver version: 530.30.02

Aug 08 '23 23:08 hulihan-start

To cover the bases, can you check how much disk space is free on your device?

Aug 10 '23 21:08 IzzyPutterman