meta-prompts
meta-prompts copied to clipboard
Faced some errors in validation part.
Faced some errors in validation part.
Traceback (most recent call last):
File "train.py", line 432, in <module>
main()
File "train.py", line 173, in main
results_dict, loss_val = validate(val_loader, model, criterion_d,
File "train.py", line 424, in validate
result_metrics[key] = ddp_logger.meters[key].global_avg
File "/home/spai/code/SD/meta-prompts/depth/utils.py", line 68, in global_avg
return self.total / self.count
ZeroDivisionError: float division by zero
Traceback (most recent call last):
File "train.py", line 432, in <module>
main()
File "train.py", line 173, in main
results_dict, loss_val = validate(val_loader, model, criterion_d,
File "train.py", line 424, in validate
result_metrics[key] = ddp_logger.meters[key].global_avg
File "/home/spai/code/SD/meta-prompts/depth/utils.py", line 68, in global_avg
return self.total / self.count
ZeroDivisionError: float division by zero
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1056935) of binary: /home/spai/anaconda3/envs/metap/bin/python3
Traceback (most recent call last):
File "/home/spai/anaconda3/envs/metap/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/spai/anaconda3/envs/metap/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/spai/anaconda3/envs/metap/lib/python3.8/site-packages/torch/distributed/launch.py", line 196, in <module>
main()
File "/home/spai/anaconda3/envs/metap/lib/python3.8/site-packages/torch/distributed/launch.py", line 192, in main
launch(args)
File "/home/spai/anaconda3/envs/metap/lib/python3.8/site-packages/torch/distributed/launch.py", line 177, in launch
run(args)
File "/home/spai/anaconda3/envs/metap/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/spai/anaconda3/envs/metap/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/spai/anaconda3/envs/metap/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2024-03-06_20:47:35
host : spai-WS-E900-G4-WS980T
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 1056936)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-03-06_20:47:35
host : spai-WS-E900-G4-WS980T
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 1056935)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Can you tell me how to solve it?
We have checked that following the instructions in the README for installing dependencies and preparing the dataset does not result in such an error. You need to ensure that when modifying the code, essential parameter update processes are not removed. For example, removing the line ddp_logger.update(**computed_result)
in train.py
will reproduce your error.
Thank you for your suggestion, it worked for me!
I have another question regarding the application of this method. Is it be able to be used in image to image translation? If so, which part should I do? Currently I am using the depth estimation pipeline, with depth upper bound removed. But I am not sure if there is anything else that I should do?
Thank you for your advice. Best wishs!
Faced some errors in validation part.
Traceback (most recent call last): File "train.py", line 432, in <module> main() File "train.py", line 173, in main results_dict, loss_val = validate(val_loader, model, criterion_d, File "train.py", line 424, in validate result_metrics[key] = ddp_logger.meters[key].global_avg File "/home/spai/code/SD/meta-prompts/depth/utils.py", line 68, in global_avg return self.total / self.count ZeroDivisionError: float division by zero Traceback (most recent call last): File "train.py", line 432, in <module> main() File "train.py", line 173, in main results_dict, loss_val = validate(val_loader, model, criterion_d, File "train.py", line 424, in validate result_metrics[key] = ddp_logger.meters[key].global_avg File "/home/spai/code/SD/meta-prompts/depth/utils.py", line 68, in global_avg return self.total / self.count ZeroDivisionError: float division by zero ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1056935) of binary: /home/spai/anaconda3/envs/metap/bin/python3 Traceback (most recent call last): File "/home/spai/anaconda3/envs/metap/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/spai/anaconda3/envs/metap/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/spai/anaconda3/envs/metap/lib/python3.8/site-packages/torch/distributed/launch.py", line 196, in <module> main() File "/home/spai/anaconda3/envs/metap/lib/python3.8/site-packages/torch/distributed/launch.py", line 192, in main launch(args) File "/home/spai/anaconda3/envs/metap/lib/python3.8/site-packages/torch/distributed/launch.py", line 177, in launch run(args) File "/home/spai/anaconda3/envs/metap/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/spai/anaconda3/envs/metap/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/spai/anaconda3/envs/metap/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-03-06_20:47:35 host : spai-WS-E900-G4-WS980T rank : 1 (local_rank: 1) exitcode : 1 (pid: 1056936) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-03-06_20:47:35 host : spai-WS-E900-G4-WS980T rank : 0 (local_rank: 0) exitcode : 1 (pid: 1056935) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================
Can you tell me how to solve it?
i found a similar error, could u tell me how did u solve it? thank u !