[Transformer/Translation] OSError: [Errno 28] No space left on device
Related to Model/Framework(s) PyTorch/Translation/Transformer
Describe the bug
I followed by the instructions in README.md but found this issue.
Tue Aug 8 23:35:31 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3090 On | 00000000:1F:00.0 Off | N/A |
| 30% 29C P8 30W / 350W| 6MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce RTX 3090 On | 00000000:21:00.0 Off | N/A |
| 30% 30C P8 33W / 350W| 6MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA GeForce RTX 3090 On | 00000000:22:00.0 Off | N/A |
| 30% 31C P8 24W / 350W| 6MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA GeForce RTX 3090 On | 00000000:24:00.0 Off | N/A |
| 30% 29C P8 27W / 350W| 6MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
Namespace(adam_betas=[0.9, 0.997], adam_eps=1e-09, amp=True, arch='transformer_wmt_en_de_big_t2t', attention_dropout=0.1, beam=4, bpe_codes=None, buffer_size=64, clip_norm=0.0, cpu=False, criterion='label_smoothed_cross_entropy', data='/data/', decoder_attention_heads=16, decoder_embed_dim=1024, decoder_embed_path=None, decoder_ffn_embed_dim=4096, decoder_layers=6, decoder_learned_pos=False, decoder_normalize_before=True, distributed_rank=0, distributed_world_size=1, do_sanity_check=False, dropout=0.1, encoder_attention_heads=16, encoder_embed_dim=1024, encoder_embed_path=None, encoder_ffn_embed_dim=4096, encoder_layers=6, encoder_learned_pos=False, encoder_normalize_before=True, file=None, fp16=False, fuse_dropout_add=False, fuse_layer_norm=True, fuse_relu_dropout=False, gen_subset='test', label_smoothing=0.1, left_pad_source=True, left_pad_target=False, lenpen=1, local_rank=0, log_interval=500, lr=[0.000846], lr_scheduler='inverse_sqrt', lr_shrink=0.1, max_epoch=30, max_len_a=0, max_len_b=200, max_positions=(1024, 1024), max_sentences=None, max_sentences_valid=None, max_source_positions=1024, max_target_positions=1024, max_tokens=256, max_update=0, min_len=1, min_lr=0.0, momentum=0.99, nbest=1, no_beamable_mm=False, no_early_stop=False, no_epoch_checkpoints=False, no_save=True, no_token_positional_embeddings=False, num_shards=1, online_eval=True, optimizer='adam', pad_sequence=1, path=None, prefix_size=0, print_alignment=False, quiet=False, raw_text=False, relu_dropout=0.1, remove_bpe='@@ ', replace_unk=None, restore_file='checkpoint_last.pt', sampling=False, sampling_temperature=1, sampling_topk=-1, save_dir='/results', save_interval=1, save_predictions=False, seed=1, shard_id=0, share_all_embeddings=True, share_decoder_input_output_embed=False, source_lang=None, stat_file='/results/run_log.json', target_bleu=0.0, target_lang=None, test_cased_bleu=False, train_subset='train', unkpen=0, unnormalized=False, update_freq=[1], valid_subset='valid', validate_interval=1, warmup_init_lr=0.0, warmup_updates=4000, weight_decay=0.0)
Traceback (most recent call last):
File "/workspace/translation/train.py", line 433, in
main(ARGS)
File "/workspace/translation/train.py", line 47, in main
setup_logger(args)
File "/workspace/translation/fairseq/log_helper.py", line 176, in setup_logger
TensorBoardBackend(verbosity=1, log_dir=args.save_dir)])
File "/workspace/translation/fairseq/log_helper.py", line 131, in init
self.summary_writer = SummaryWriter(log_dir=os.path.join(log_dir, 'TB_summary'),
File "/opt/conda/lib/python3.8/site-packages/tensorboardX/writer.py", line 300, in init
self._get_file_writer()
File "/opt/conda/lib/python3.8/site-packages/tensorboardX/writer.py", line 348, in _get_file_writer
self.file_writer = FileWriter(logdir=self.logdir,
File "/opt/conda/lib/python3.8/site-packages/tensorboardX/writer.py", line 104, in init
self.event_writer = EventFileWriter(
File "/opt/conda/lib/python3.8/site-packages/tensorboardX/event_file_writer.py", line 105, in init
self._event_queue = multiprocessing.Queue(max_queue_size)
File "/opt/conda/lib/python3.8/multiprocessing/context.py", line 103, in Queue
return Queue(maxsize, ctx=self.get_context())
File "/opt/conda/lib/python3.8/multiprocessing/queues.py", line 48, in init
self._sem = ctx.BoundedSemaphore(maxsize)
File "/opt/conda/lib/python3.8/multiprocessing/context.py", line 88, in BoundedSemaphore
return BoundedSemaphore(value, ctx=self.get_context())
File "/opt/conda/lib/python3.8/multiprocessing/synchronize.py", line 145, in init
SemLock.init(self, SEMAPHORE, value, value, ctx=ctx)
File "/opt/conda/lib/python3.8/multiprocessing/synchronize.py", line 57, in init
sl = self._semlock = _multiprocessing.SemLock(
OSError: [Errno 28] No space left on device
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 429) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 765, in
main()
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main
run(args)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
/workspace/translation/train.py FAILED
Failures: <NO_OTHER_FAILURES>
Root Cause (first observed failure): [0]: time : 2023-08-08_23:35:37 host : b0c9f9852e1d rank : 0 (local_rank: 0) exitcode : 1 (pid: 429) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
To Reproduce
Steps to reproduce the behavior:
In this docker file, you should change the numpy version to 1.22.1. Try run pip install numpy==1.22.1
I just build the environment followed by README.md, and then run bash scripts/run_training.sh
Expected behavior I cannot test the training.
Environment Please provide at least:
- Container version: FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:22.06-py3
- GPUs in the system: 4x RTX 3090-24GB
- CUDA driver version: 530.30.02
To cover the bases, can you check how much disk space is free on your device?