modulus-sym
modulus-sym copied to clipboard
🐛[BUG]: SequentialSolver breaking when executed in parallel
Version
1.2.0
On which installation method(s) does this occur?
Pip
Describe the issue
While executing the taylor_green.py example using the SLURM directive srun, the solver breaks. I'm running the taylor_green.py example in parallel using 4 NVIDIA V100 16 GB graphics cards.
Minimum reproducible example
srun --ntasks-per-node 4 python3 taylor_green.py
Relevant log output
rm: cannot remove './outputs': No such file or directory
CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
To initialize your shell, run
$ conda init <SHELL_NAME>
Currently supported shells are:
- bash
- fish
- tcsh
- xonsh
- zsh
- powershell
See 'conda init --help' for more information and options.
IMPORTANT: You may need to close and restart your shell after running 'conda init'.
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [jwc09n000i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:436] [c10d] The server socket cannot be initialized on [::]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [jwc09n000i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [jwc09n000i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [jwc09n000i.juwels]:7010 (errno: 97 - Address family not supported by protocol).
/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
ret = run_job(
/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
ret = run_job(
/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
ret = run_job(
/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
ret = run_job(
Error executing job with overrides: []
Traceback (most recent call last):
File "/p/project/rugshas/villalobos1/PINN/modulus-sym/examples/taylor_green/taylor_green.py", line 166, in <module>
run()
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/hydra/utils.py", line 104, in func_decorated
_run_hydra(
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
_run_app(
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app
run_and_report(
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
raise ex
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
Error executing job with overrides: []
return func()
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
lambda: hydra.run(
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
Traceback (most recent call last):
File "/p/project/rugshas/villalobos1/PINN/modulus-sym/examples/taylor_green/taylor_green.py", line 166, in <module>
_ = ret.return_value
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
raise self._return_value
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
ret.return_value = task_function(task_cfg)
File "/p/project/rugshas/villalobos1/PINN/modulus-sym/examples/taylor_green/taylor_green.py", line 162, in run
run()
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/hydra/utils.py", line 104, in func_decorated
slv.solve()
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/solver/sequential.py", line 138, in solve
_run_hydra(
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
self._train_loop(sigterm_handler)
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/trainer.py", line 535, in _train_loop
_run_app(
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app
loss, losses = self._cuda_graph_training_step(step)
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/trainer.py", line 716, in _cuda_graph_training_step
run_and_report(
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
self.loss_static, self.losses_static = self.compute_gradients(
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/trainer.py", line 68, in adam_compute_gradients
Error executing job with overrides: []
raise ex
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
losses_minibatch = self.compute_losses(step)
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/solver/solver.py", line 66, in compute_losses
Traceback (most recent call last):
File "/p/project/rugshas/villalobos1/PINN/modulus-sym/examples/taylor_green/taylor_green.py", line 166, in <module>
return func()
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
return self.domain.compute_losses(step)
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/domain/domain.py", line 147, in compute_losses
lambda: hydra.run(
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
constraint.forward()
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/domain/constraint/continuous.py", line 130, in forward
_ = ret.return_value
run()
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/hydra/utils.py", line 104, in func_decorated
self._output_vars = self.model(self._input_vars)
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
raise self._return_value
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
_run_hydra(
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
Error executing job with overrides: []
ret.return_value = task_function(task_cfg)
File "/p/project/rugshas/villalobos1/PINN/modulus-sym/examples/taylor_green/taylor_green.py", line 162, in run
_run_app(
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app
return self._call_impl(*args, **kwargs)
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
Traceback (most recent call last):
File "/p/project/rugshas/villalobos1/PINN/modulus-sym/examples/taylor_green/taylor_green.py", line 166, in <module>
slv.solve()
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/solver/sequential.py", line 138, in solve
run_and_report(
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
self._train_loop(sigterm_handler)
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/trainer.py", line 535, in _train_loop
raise ex
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
return forward_call(*args, **kwargs)
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1515, in forward
run()
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/hydra/utils.py", line 104, in func_decorated
loss, losses = self._cuda_graph_training_step(step)
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/trainer.py", line 716, in _cuda_graph_training_step
return func()
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
_run_hydra(
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
lambda: hydra.run(
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
self.loss_static, self.losses_static = self.compute_gradients(
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/trainer.py", line 68, in adam_compute_gradients
_run_app(
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app
inputs, kwargs = self._pre_forward(*inputs, **kwargs)
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1409, in _pre_forward
losses_minibatch = self.compute_losses(step)
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/solver/solver.py", line 66, in compute_losses
_ = ret.return_value
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
run_and_report(
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
return self.domain.compute_losses(step)
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/domain/domain.py", line 147, in compute_losses
raise self._return_value
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
raise ex
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
constraint.forward()
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/domain/constraint/continuous.py", line 130, in forward
if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
making sure all `forward` function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 0: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
ret.return_value = task_function(task_cfg)
File "/p/project/rugshas/villalobos1/PINN/modulus-sym/examples/taylor_green/taylor_green.py", line 162, in run
return func()
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
self._output_vars = self.model(self._input_vars)
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
slv.solve()
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/solver/sequential.py", line 138, in solve
lambda: hydra.run(
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
self._train_loop(sigterm_handler)
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/trainer.py", line 535, in _train_loop
_ = ret.return_value
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
return self._call_impl(*args, **kwargs)
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
raise self._return_value
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
loss, losses = self._cuda_graph_training_step(step)
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/trainer.py", line 716, in _cuda_graph_training_step
ret.return_value = task_function(task_cfg)
File "/p/project/rugshas/villalobos1/PINN/modulus-sym/examples/taylor_green/taylor_green.py", line 162, in run
slv.solve()
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/solver/sequential.py", line 138, in solve
return forward_call(*args, **kwargs)
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1515, in forward
self._train_loop(sigterm_handler)
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/trainer.py", line 535, in _train_loop
self.loss_static, self.losses_static = self.compute_gradients(
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/trainer.py", line 68, in adam_compute_gradients
loss, losses = self._cuda_graph_training_step(step)
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/trainer.py", line 716, in _cuda_graph_training_step
losses_minibatch = self.compute_losses(step)
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/solver/solver.py", line 66, in compute_losses
inputs, kwargs = self._pre_forward(*inputs, **kwargs)
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1409, in _pre_forward
return self.domain.compute_losses(step)
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/domain/domain.py", line 147, in compute_losses
self.loss_static, self.losses_static = self.compute_gradients(
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/trainer.py", line 68, in adam_compute_gradients
constraint.forward()
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/domain/constraint/continuous.py", line 130, in forward
losses_minibatch = self.compute_losses(step)
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/solver/solver.py", line 66, in compute_losses
if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
making sure all `forward` function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 3: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
return self.domain.compute_losses(step)
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/domain/domain.py", line 147, in compute_losses
self._output_vars = self.model(self._input_vars)
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
constraint.forward()
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/modulus/sym/domain/constraint/continuous.py", line 130, in forward
self._output_vars = self.model(self._input_vars)
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return self._call_impl(*args, **kwargs)
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1515, in forward
return forward_call(*args, **kwargs)
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1515, in forward
inputs, kwargs = self._pre_forward(*inputs, **kwargs)
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1409, in _pre_forward
inputs, kwargs = self._pre_forward(*inputs, **kwargs)
File "/p/project/rugshas/villalobos1/miniconda3/envs/modulus/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1409, in _pre_forward
if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
making sure all `forward` function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 2: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
making sure all `forward` function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 1: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
srun: error: jwc09n000: tasks 0-3: Exited with exit code 1
Environment details
I'm using the JUWELS supercomputer at FZJ, using the 1.2.0 Modulus pip installation.
Other/Misc.
No response