sagemaker-debugger FileNotFoundError when using SageMaker Debugger with PyTorch Distributed Training on SageMaker

I am using a custom docker image to run distributed training with PyTorch on SageMaker. The training script is taken from https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Segmentation/MaskRCNN. The DLC Image uses pytorch-training:1.6.0-gpu-py3 as the base image.

Following is the error traceback :

[1,9]<stdout>:
[1,13]<stdout>:Traceback (most recent call last):
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/shutil.py", line 550, in move
[1,13]<stdout>:    os.rename(src, real_dst)
[1,13]<stdout>:FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/output/tensors/events/000000000000/000000000000_worker_0.tfevents.tmp' -> '/opt/ml/output/tensors/events/000000000000/000000000000_worker_0.tfevents'
[1,13]<stdout>:
[1,13]<stdout>:During handling of the above exception, another exception occurred:
[1,13]<stdout>:
[1,13]<stdout>:Traceback (most recent call last):
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
[1,13]<stdout>:    "__main__", mod_spec)
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
[1,13]<stdout>:    exec(code, run_globals)
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/site-packages/mpi4py/__main__.py", line 7, in <module>
[1,13]<stdout>:    main()
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/site-packages/mpi4py/run.py", line 196, in main
[1,13]<stdout>:    run_command_line(args)
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/site-packages/mpi4py/run.py", line 47, in run_command_line
[1,13]<stdout>:    run_path(sys.argv[0], run_name='__main__')
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/runpy.py", line 263, in run_path
[1,13]<stdout>:    pkg_name=pkg_name, script_name=fname)
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/runpy.py", line 96, in _run_module_code
[1,13]<stdout>:    mod_name, mod_spec, pkg_name, script_name)
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
[1,13]<stdout>:    exec(code, run_globals)
[1,13]<stdout>:  File "train_net.py", line 306, in <module>
[1,13]<stdout>:    main()
[1,13]<stdout>:  File "train_net.py", line 298, in main
[1,13]<stdout>:    model = train(cfg, args)
[1,13]<stdout>:  File "train_net.py", line 165, in train
[1,13]<stdout>:    per_iter_end_callback_fn=per_iter_callback_fn,
[1,13]<stdout>:  File "/root/DeepLearningExamples/PyTorch/Segmentation/MaskRCNN/pytorch/maskrcnn_benchmark/engine/trainer.py", line 78, in do_train
[1,13]<stdout>:    loss_dict = model(images, targets)
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 724, in _call_impl
[1,13]<stdout>:    result = hook(self, input)
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/site-packages/smdebug/pytorch/hook.py", line 123, in forward_pre_hook
[1,13]<stdout>:    self._close_writers()
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/site-packages/smdebug/core/hook.py", line 433, in _close_writers
[1,13]<stdout>:    self.writer.close()
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/site-packages/smdebug/core/writer.py", line 201, in close
[1,13]<stdout>:    self._writer.close()
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/site-packages/smdebug/core/tfevent/event_file_writer.py", line 125, in close
[1,13]<stdout>:    self._ev_writer.close()
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/site-packages/smdebug/core/tfevent/events_writer.py", line 63, in close
[1,13]<stdout>:    self.tfrecord_writer.close()
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/site-packages/smdebug/core/tfrecord/record_writer.py", line 81, in close
[1,13]<stdout>:    self._writer.close()
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/site-packages/smdebug/core/access_layer/file.py", line 53, in close
[1,13]<stdout>:    shutil.move(self.temp_path, self.path)
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/shutil.py", line 564, in move
[1,13]<stdout>:    copy_function(src, real_dst)
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/shutil.py", line 263, in copy2
[1,13]<stdout>:    copyfile(src, dst, follow_symlinks=follow_symlinks)
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/shutil.py", line 120, in copyfile
[1,13]<stdout>:    with open(src, 'rb') as fsrc:
[1,13]<stdout>:FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/output/tensors/events/000000000000/000000000000_worker_0.tfevents.tmp'
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 13 in communicator MPI COMMUNICATOR 5 DUP FROM 0
with errorcode 1.

@Vikas-kum

Oct 29 '20 16:10 piyushghai

The fix is checkedin in for 1.6 which avoids registering hook to non-training activities. It is currently under review.

Nov 02 '20 18:11 leleamol

@leleamol Can you point to fix PR?

Dec 08 '20 19:12 Vikas-kum