ViTPose
ViTPose copied to clipboard
Pretrained model test does not start up on Jetson TX2 ( FileNotFoundError: file "/home/_base_/default_runtime.py" does not exist )
Hello, I'm trying to test this model on Jetson TX2.
I have successfully installed all of the prerequisites but the suggested commands for testing the pretrained model do not work.
Input command
bash tools/dist_test.sh /home/nvidia/ViTPose/configs/my_configs/ViTPose_base_simple_coco_256x192.py /home/nvidia/vitpose-weights/vitpose-b-simple.pth 1
(the username is nvidia
).
Full output
The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run
WARNING:torch.distributed.run:--use_env is deprecated and will be removed in future releases.
Please read local_rank from `os.environ('LOCAL_RANK')` instead.
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
entrypoint : tools/test.py
min_nodes : 1
max_nodes : 1
nproc_per_node : 1
run_id : none
rdzv_backend : static
rdzv_endpoint : 127.0.0.1:29500
rdzv_configs : {'rank': 0, 'timeout': 900}
max_restarts : 3
monitor_interval : 5
log_dir : None
metrics_cfg : {}
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_omhgsygk/none_jylhxkow
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
/home/nvidia/.local/lib/python3.6/site-packages/torch/distributed/elastic/utils/store.py:53: FutureWarning: This is an experimental API and will be changed in future.
"This is an experimental API and will be changed in future.", FutureWarning
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=0
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0]
role_ranks=[0]
global_ranks=[0]
role_world_sizes=[1]
global_world_sizes=[1]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_omhgsygk/none_jylhxkow/attempt_0/0/error.json
apex is not installed
apex is not installed
apex is not installed
Traceback (most recent call last):
File "tools/test.py", line 184, in <module>
main()
File "tools/test.py", line 90, in main
cfg = Config.fromfile(args.config)
File "/home/nvidia/mmcv/mmcv/utils/config.py", line 316, in fromfile
use_predefined_variables)
File "/home/nvidia/mmcv/mmcv/utils/config.py", line 234, in _file2dict
_cfg_dict, _cfg_text = Config._file2dict(osp.join(cfg_dir, f))
File "/home/nvidia/mmcv/mmcv/utils/config.py", line 180, in _file2dict
check_file_exist(filename)
File "/home/nvidia/mmcv/mmcv/utils/path.py", line 23, in check_file_exist
raise FileNotFoundError(msg_tmpl.format(filename))
FileNotFoundError: file "/home/_base_/default_runtime.py" does not exist
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 12280) of binary: /home/nvidia/ViTPose/test_env/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=1
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0]
role_ranks=[0]
global_ranks=[0]
role_world_sizes=[1]
global_world_sizes=[1]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_omhgsygk/none_jylhxkow/attempt_1/0/error.json
apex is not installed
apex is not installed
apex is not installed
Traceback (most recent call last):
File "tools/test.py", line 184, in <module>
main()
File "tools/test.py", line 90, in main
cfg = Config.fromfile(args.config)
File "/home/nvidia/mmcv/mmcv/utils/config.py", line 316, in fromfile
use_predefined_variables)
File "/home/nvidia/mmcv/mmcv/utils/config.py", line 234, in _file2dict
_cfg_dict, _cfg_text = Config._file2dict(osp.join(cfg_dir, f))
File "/home/nvidia/mmcv/mmcv/utils/config.py", line 180, in _file2dict
check_file_exist(filename)
File "/home/nvidia/mmcv/mmcv/utils/path.py", line 23, in check_file_exist
raise FileNotFoundError(msg_tmpl.format(filename))
FileNotFoundError: file "/home/_base_/default_runtime.py" does not exist
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 12296) of binary: /home/nvidia/ViTPose/test_env/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 2/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=2
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0]
role_ranks=[0]
global_ranks=[0]
role_world_sizes=[1]
global_world_sizes=[1]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_omhgsygk/none_jylhxkow/attempt_2/0/error.json
apex is not installed
apex is not installed
apex is not installed
Traceback (most recent call last):
File "tools/test.py", line 184, in <module>
main()
File "tools/test.py", line 90, in main
cfg = Config.fromfile(args.config)
File "/home/nvidia/mmcv/mmcv/utils/config.py", line 316, in fromfile
use_predefined_variables)
File "/home/nvidia/mmcv/mmcv/utils/config.py", line 234, in _file2dict
_cfg_dict, _cfg_text = Config._file2dict(osp.join(cfg_dir, f))
File "/home/nvidia/mmcv/mmcv/utils/config.py", line 180, in _file2dict
check_file_exist(filename)
File "/home/nvidia/mmcv/mmcv/utils/path.py", line 23, in check_file_exist
raise FileNotFoundError(msg_tmpl.format(filename))
FileNotFoundError: file "/home/_base_/default_runtime.py" does not exist
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 12313) of binary: /home/nvidia/ViTPose/test_env/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 1/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=3
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0]
role_ranks=[0]
global_ranks=[0]
role_world_sizes=[1]
global_world_sizes=[1]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_omhgsygk/none_jylhxkow/attempt_3/0/error.json
apex is not installed
apex is not installed
apex is not installed
Traceback (most recent call last):
File "tools/test.py", line 184, in <module>
main()
File "tools/test.py", line 90, in main
cfg = Config.fromfile(args.config)
File "/home/nvidia/mmcv/mmcv/utils/config.py", line 316, in fromfile
use_predefined_variables)
File "/home/nvidia/mmcv/mmcv/utils/config.py", line 234, in _file2dict
_cfg_dict, _cfg_text = Config._file2dict(osp.join(cfg_dir, f))
File "/home/nvidia/mmcv/mmcv/utils/config.py", line 180, in _file2dict
check_file_exist(filename)
File "/home/nvidia/mmcv/mmcv/utils/path.py", line 23, in check_file_exist
raise FileNotFoundError(msg_tmpl.format(filename))
FileNotFoundError: file "/home/_base_/default_runtime.py" does not exist
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 12342) of binary: /home/nvidia/ViTPose/test_env/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (FAILED). Waiting 300 seconds for other agents to finish
/home/nvidia/.local/lib/python3.6/site-packages/torch/distributed/elastic/utils/store.py:71: FutureWarning: This is an experimental API and will be changed in future.
"This is an experimental API and will be changed in future.", FutureWarning
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0020799636840820312 seconds
{"name": "torchelastic.worker.status.FAILED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 0, "group_rank": 0, "worker_id": "12342", "role": "default", "hostname": "ubuntu", "state": "FAILED", "total_run_time": 60, "rdzv_backend": "static", "raw_error": "{\"message\": \"<NONE>\"}", "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [0], \"role_rank\": [0], \"role_world_size\": [1]}", "agent_restarts": 3}}
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": null, "group_rank": 0, "worker_id": null, "role": "default", "hostname": "ubuntu", "state": "SUCCEEDED", "total_run_time": 60, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\"}", "agent_restarts": 3}}
/home/nvidia/.local/lib/python3.6/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py:354: UserWarning:
**********************************************************************
CHILD PROCESS FAILED WITH NO ERROR_FILE
**********************************************************************
CHILD PROCESS FAILED WITH NO ERROR_FILE
Child process 12342 (local_rank 0) FAILED (exitcode 1)
Error msg: Process failed with exitcode 1
Without writing an error file to <N/A>.
While this DOES NOT affect the correctness of your application,
no trace information about the error will be available for inspection.
Consider decorating your top level entrypoint function with
torch.distributed.elastic.multiprocessing.errors.record. Example:
from torch.distributed.elastic.multiprocessing.errors import record
@record
def trainer_main(args):
# do train
**********************************************************************
warnings.warn(_no_error_file_warning_msg(rank, failure))
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/nvidia/.local/lib/python3.6/site-packages/torch/distributed/launch.py", line 173, in <module>
main()
File "/home/nvidia/.local/lib/python3.6/site-packages/torch/distributed/launch.py", line 169, in main
run(args)
File "/home/nvidia/.local/lib/python3.6/site-packages/torch/distributed/run.py", line 624, in run
)(*cmd_args)
File "/home/nvidia/.local/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 116, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/nvidia/.local/lib/python3.6/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
return f(*args, **kwargs)
File "/home/nvidia/.local/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
***************************************
tools/test.py FAILED
=======================================
Root Cause:
[0]:
time: 2023-03-31_11:18:11
rank: 0 (local_rank: 0)
exitcode: 1 (pid: 12342)
error_file: <N/A>
msg: "Process failed with exitcode 1"
=======================================
Other Failures:
<NO_OTHER_FAILURES>
***************************************
Other info
It's probably worth mentioning that I'm using a pip venv for my tests.
One weird thing I have noticed is that the program is looking for a base home directory which obviously doesn't exist. _base_
is the default value of BASE_KEY
in config.py
but I have no idea what this means (it must be related to the inner workings of mmcv I guess).
System setup
- Python 3.6.9
- Torch 1.9.0
- CUDA 10.2
- Linux L4T 32.7.1 (based on Ubuntu 18.04)
- Jetpack 4.6.1
Any ideas?
Same. Have you solved the problem?