ViTPose icon indicating copy to clipboard operation
ViTPose copied to clipboard

Pretrained model test does not start up on Jetson TX2 ( FileNotFoundError: file "/home/_base_/default_runtime.py" does not exist )

Open Stack-it-up opened this issue 1 year ago • 1 comments

Hello, I'm trying to test this model on Jetson TX2.

I have successfully installed all of the prerequisites but the suggested commands for testing the pretrained model do not work.

Input command

bash tools/dist_test.sh /home/nvidia/ViTPose/configs/my_configs/ViTPose_base_simple_coco_256x192.py /home/nvidia/vitpose-weights/vitpose-b-simple.pth 1

(the username is nvidia).

Full output

The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run
WARNING:torch.distributed.run:--use_env is deprecated and will be removed in future releases.
 Please read local_rank from `os.environ('LOCAL_RANK')` instead.
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
  entrypoint       : tools/test.py
  min_nodes        : 1
  max_nodes        : 1
  nproc_per_node   : 1
  run_id           : none
  rdzv_backend     : static
  rdzv_endpoint    : 127.0.0.1:29500
  rdzv_configs     : {'rank': 0, 'timeout': 900}
  max_restarts     : 3
  monitor_interval : 5
  log_dir          : None
  metrics_cfg      : {}

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_omhgsygk/none_jylhxkow
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
/home/nvidia/.local/lib/python3.6/site-packages/torch/distributed/elastic/utils/store.py:53: FutureWarning: This is an experimental API and will be changed in future.
  "This is an experimental API and will be changed in future.", FutureWarning
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=0
  master_addr=127.0.0.1
  master_port=29500
  group_rank=0
  group_world_size=1
  local_ranks=[0]
  role_ranks=[0]
  global_ranks=[0]
  role_world_sizes=[1]
  global_world_sizes=[1]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_omhgsygk/none_jylhxkow/attempt_0/0/error.json
apex is not installed
apex is not installed
apex is not installed
Traceback (most recent call last):
  File "tools/test.py", line 184, in <module>
    main()
  File "tools/test.py", line 90, in main
    cfg = Config.fromfile(args.config)
  File "/home/nvidia/mmcv/mmcv/utils/config.py", line 316, in fromfile
    use_predefined_variables)
  File "/home/nvidia/mmcv/mmcv/utils/config.py", line 234, in _file2dict
    _cfg_dict, _cfg_text = Config._file2dict(osp.join(cfg_dir, f))
  File "/home/nvidia/mmcv/mmcv/utils/config.py", line 180, in _file2dict
    check_file_exist(filename)
  File "/home/nvidia/mmcv/mmcv/utils/path.py", line 23, in check_file_exist
    raise FileNotFoundError(msg_tmpl.format(filename))
FileNotFoundError: file "/home/_base_/default_runtime.py" does not exist
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 12280) of binary: /home/nvidia/ViTPose/test_env/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=1
  master_addr=127.0.0.1
  master_port=29500
  group_rank=0
  group_world_size=1
  local_ranks=[0]
  role_ranks=[0]
  global_ranks=[0]
  role_world_sizes=[1]
  global_world_sizes=[1]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_omhgsygk/none_jylhxkow/attempt_1/0/error.json
apex is not installed
apex is not installed
apex is not installed
Traceback (most recent call last):
  File "tools/test.py", line 184, in <module>
    main()
  File "tools/test.py", line 90, in main
    cfg = Config.fromfile(args.config)
  File "/home/nvidia/mmcv/mmcv/utils/config.py", line 316, in fromfile
    use_predefined_variables)
  File "/home/nvidia/mmcv/mmcv/utils/config.py", line 234, in _file2dict
    _cfg_dict, _cfg_text = Config._file2dict(osp.join(cfg_dir, f))
  File "/home/nvidia/mmcv/mmcv/utils/config.py", line 180, in _file2dict
    check_file_exist(filename)
  File "/home/nvidia/mmcv/mmcv/utils/path.py", line 23, in check_file_exist
    raise FileNotFoundError(msg_tmpl.format(filename))
FileNotFoundError: file "/home/_base_/default_runtime.py" does not exist
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 12296) of binary: /home/nvidia/ViTPose/test_env/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 2/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=2
  master_addr=127.0.0.1
  master_port=29500
  group_rank=0
  group_world_size=1
  local_ranks=[0]
  role_ranks=[0]
  global_ranks=[0]
  role_world_sizes=[1]
  global_world_sizes=[1]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_omhgsygk/none_jylhxkow/attempt_2/0/error.json
apex is not installed
apex is not installed
apex is not installed
Traceback (most recent call last):
  File "tools/test.py", line 184, in <module>
    main()
  File "tools/test.py", line 90, in main
    cfg = Config.fromfile(args.config)
  File "/home/nvidia/mmcv/mmcv/utils/config.py", line 316, in fromfile
    use_predefined_variables)
  File "/home/nvidia/mmcv/mmcv/utils/config.py", line 234, in _file2dict
    _cfg_dict, _cfg_text = Config._file2dict(osp.join(cfg_dir, f))
  File "/home/nvidia/mmcv/mmcv/utils/config.py", line 180, in _file2dict
    check_file_exist(filename)
  File "/home/nvidia/mmcv/mmcv/utils/path.py", line 23, in check_file_exist
    raise FileNotFoundError(msg_tmpl.format(filename))
FileNotFoundError: file "/home/_base_/default_runtime.py" does not exist
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 12313) of binary: /home/nvidia/ViTPose/test_env/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 1/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=3
  master_addr=127.0.0.1
  master_port=29500
  group_rank=0
  group_world_size=1
  local_ranks=[0]
  role_ranks=[0]
  global_ranks=[0]
  role_world_sizes=[1]
  global_world_sizes=[1]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_omhgsygk/none_jylhxkow/attempt_3/0/error.json
apex is not installed
apex is not installed
apex is not installed
Traceback (most recent call last):
  File "tools/test.py", line 184, in <module>
    main()
  File "tools/test.py", line 90, in main
    cfg = Config.fromfile(args.config)
  File "/home/nvidia/mmcv/mmcv/utils/config.py", line 316, in fromfile
    use_predefined_variables)
  File "/home/nvidia/mmcv/mmcv/utils/config.py", line 234, in _file2dict
    _cfg_dict, _cfg_text = Config._file2dict(osp.join(cfg_dir, f))
  File "/home/nvidia/mmcv/mmcv/utils/config.py", line 180, in _file2dict
    check_file_exist(filename)
  File "/home/nvidia/mmcv/mmcv/utils/path.py", line 23, in check_file_exist
    raise FileNotFoundError(msg_tmpl.format(filename))
FileNotFoundError: file "/home/_base_/default_runtime.py" does not exist
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 12342) of binary: /home/nvidia/ViTPose/test_env/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (FAILED). Waiting 300 seconds for other agents to finish
/home/nvidia/.local/lib/python3.6/site-packages/torch/distributed/elastic/utils/store.py:71: FutureWarning: This is an experimental API and will be changed in future.
  "This is an experimental API and will be changed in future.", FutureWarning
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0020799636840820312 seconds
{"name": "torchelastic.worker.status.FAILED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 0, "group_rank": 0, "worker_id": "12342", "role": "default", "hostname": "ubuntu", "state": "FAILED", "total_run_time": 60, "rdzv_backend": "static", "raw_error": "{\"message\": \"<NONE>\"}", "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [0], \"role_rank\": [0], \"role_world_size\": [1]}", "agent_restarts": 3}}
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": null, "group_rank": 0, "worker_id": null, "role": "default", "hostname": "ubuntu", "state": "SUCCEEDED", "total_run_time": 60, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\"}", "agent_restarts": 3}}
/home/nvidia/.local/lib/python3.6/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py:354: UserWarning: 

**********************************************************************
               CHILD PROCESS FAILED WITH NO ERROR_FILE                
**********************************************************************
CHILD PROCESS FAILED WITH NO ERROR_FILE
Child process 12342 (local_rank 0) FAILED (exitcode 1)
Error msg: Process failed with exitcode 1
Without writing an error file to <N/A>.
While this DOES NOT affect the correctness of your application,
no trace information about the error will be available for inspection.
Consider decorating your top level entrypoint function with
torch.distributed.elastic.multiprocessing.errors.record. Example:

  from torch.distributed.elastic.multiprocessing.errors import record

  @record
  def trainer_main(args):
     # do train
**********************************************************************
  warnings.warn(_no_error_file_warning_msg(rank, failure))
Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/nvidia/.local/lib/python3.6/site-packages/torch/distributed/launch.py", line 173, in <module>
    main()
  File "/home/nvidia/.local/lib/python3.6/site-packages/torch/distributed/launch.py", line 169, in main
    run(args)
  File "/home/nvidia/.local/lib/python3.6/site-packages/torch/distributed/run.py", line 624, in run
    )(*cmd_args)
  File "/home/nvidia/.local/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 116, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/nvidia/.local/lib/python3.6/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/home/nvidia/.local/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
***************************************
          tools/test.py FAILED         
=======================================
Root Cause:
[0]:
  time: 2023-03-31_11:18:11
  rank: 0 (local_rank: 0)
  exitcode: 1 (pid: 12342)
  error_file: <N/A>
  msg: "Process failed with exitcode 1"
=======================================
Other Failures:
  <NO_OTHER_FAILURES>
***************************************

Other info

It's probably worth mentioning that I'm using a pip venv for my tests.

One weird thing I have noticed is that the program is looking for a base home directory which obviously doesn't exist. _base_ is the default value of BASE_KEY in config.py but I have no idea what this means (it must be related to the inner workings of mmcv I guess).

System setup

  • Python 3.6.9
  • Torch 1.9.0
  • CUDA 10.2
  • Linux L4T 32.7.1 (based on Ubuntu 18.04)
  • Jetpack 4.6.1

Any ideas?

Stack-it-up avatar Mar 31 '23 09:03 Stack-it-up

Same. Have you solved the problem?

Anning-Li avatar Jul 31 '23 03:07 Anning-Li