sam-hq
sam-hq copied to clipboard
Data Loader throwing FileNotFound error After few epochs of training
I've used training command but every time after random number of epochs I've got FileNotFound error from dataloader.Anyone knows the solution?
error:
epoch: 14 learning rate: 1e-05
[ 0/333] eta: 0:14:51 training_loss: 0.1127 (0.1127) loss_mask: 0.0446 (0.0446) loss_dice: 0.0681 (0.0681) time: 2.6786 data: 0.3379 max mem: 10103
Traceback (most recent call last):
File "/content/drive/MyDrive/sam-hq/train/train.py", line 651, in
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2600) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train.py FAILED
Failures: <NO_OTHER_FAILURES>
Root Cause (first observed failure): [0]: time : 2023-08-01_11:51:44 host : 6198cb800e23 rank : 0 (local_rank: 0) exitcode : 1 (pid: 2600) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
This looks like a data path issue. You can check if the image corresponding to the path /content/drive/MyDrive/Iris-and-Needle-Segmentation-3/train/images/SID0615_jpg.rf.8dd4aeb70ce910df9c8716e3af21b2cd.jpg
is still there. Or is your google drive disconnected?