skill-chaining
skill-chaining copied to clipboard
MPI fails when trainer has `--wandb True`
Good day,
Given how important wandb
is in ablation studies, it would be quite helpful to get it running without crashing the script. I understand from #1 that this does not seem to affect your side, however, it is also not an issue with MPI and wandb
alone.
Running a test script like the following with mpirun -n 1
is fine.
import json
import wandb
wandb_entity="my-entity"
wandb_project="my-project"
exclude = ["device"]
with open('~/skill-chaining/log/table_lack_0825.gail.p0.123/params.json', "r") as fp:
cdict=json.load(fp)
wandb.init(
resume='table_lack_0825.gail.p0.123',
project=wandb_project,
config={k: v for k, v in cdict.items() if k not in exclude},
dir='~/skill-chaining/log/table_lack_0825.gail.p0.123',
entity=wandb_entity,
notes='',
mode="online",
)
Using MPI with run.py
and wandb enabled, however, crashes the script - it is not a resource issue or a native error to the MPI + wandb pair:
$ mpirun -n 1 python -m run --algo gail --furniture_name table_lack_0825 --demo_path demos/table_lack/Sawyer_table_lack_0825_0 --num_connects 1 --run_prefix p0 --gpu 0 --wandb True --max_global_step 100000000 --wandb_entity my-entity --wandb_project my-project
pybullet build time: Apr 21 2022 20:41:06
[DEBUG] Wandb Init Before
~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/torchvision/transforms/functional_pil.py:228: DeprecationWarning: BILINEAR is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BILINEAR instead.
interpolation: int = Image.BILINEAR,
~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/torchvision/transforms/functional_pil.py:295: DeprecationWarning: NEAREST is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.NEAREST or Dither.NONE instead.
interpolation: int = Image.NEAREST,
~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/torchvision/transforms/functional_pil.py:328: DeprecationWarning: BICUBIC is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BICUBIC instead.
interpolation: int = Image.BICUBIC,
wandb: Currently logged in as: my-team (use `wandb login --relogin` to force relogin)
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
getting local rank failed
--> Returned value No permission (-17) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
orte_ess_init failed
--> Returned value No permission (-17) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
ompi_mpi_init: ompi_rte_init failed
--> Returned "No permission" (-17) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[digi2:2953274] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
Problem at: ~/skill-chaining/method/robot_learning/main.py 133 _make_log_files
Traceback (most recent call last):
File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/wandb_init.py", line 995, in init
run = wi.init()
File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/wandb_init.py", line 648, in init
backend.cleanup()
File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/backend/backend.py", line 246, in cleanup
self.interface.join()
File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/interface/interface_shared.py", line 475, in join
super().join()
File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/interface/interface.py", line 653, in join
_ = self._communicate_shutdown()
File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/interface/interface_shared.py", line 472, in _communicate_shutdown
_ = self._communicate(record)
File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/interface/interface_shared.py", line 226, in _communicate
return self._communicate_async(rec, local=local).get(timeout=timeout)
File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/interface/interface_shared.py", line 231, in _communicate_async
raise Exception("The wandb backend process has shutdown")
Exception: The wandb backend process has shutdown
wandb: ERROR Abnormal program exit
Traceback (most recent call last):
File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/wandb_init.py", line 995, in init
run = wi.init()
File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/wandb_init.py", line 648, in init
backend.cleanup()
File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/backend/backend.py", line 246, in cleanup
self.interface.join()
File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/interface/interface_shared.py", line 475, in join
super().join()
File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/interface/interface.py", line 653, in join
_ = self._communicate_shutdown()
File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/interface/interface_shared.py", line 472, in _communicate_shutdown
_ = self._communicate(record)
File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/interface/interface_shared.py", line 226, in _communicate
return self._communicate_async(rec, local=local).get(timeout=timeout)
File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/interface/interface_shared.py", line 231, in _communicate_async
raise Exception("The wandb backend process has shutdown")
Exception: The wandb backend process has shutdown
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "~/anaconda3/envs/IKEA_1/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "~/anaconda3/envs/IKEA_1/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "~/skill-chaining/run.py", line 44, in <module>
SkillChainingRun(parser).run()
File "~/skill-chaining/run.py", line 10, in __init__
super().__init__(parser)
File "~/skill-chaining/method/robot_learning/main.py", line 44, in __init__
self._make_log_files()
File "~/skill-chaining/method/robot_learning/main.py", line 133, in _make_log_files
mode="online" if config.wandb else "disabled",
File "~/anaconda3/envs/IKEA_1/lib/python3.7/site-packages/wandb/sdk/wandb_init.py", line 1033, in init
raise Exception("problem") from error_seen
Exception: problem
Any ideia what could be the problem?