allenact icon indicating copy to clipboard operation
allenact copied to clipboard

TimeOut error when attempting to run pre-trained RoboThor model checkpoint

Open dtch1997 opened this issue 2 years ago • 11 comments

Problem

Unable to run pre-trained RoboThor model checkpoint

Steps to reproduce

Followed all instructions at https://allenact.org/tutorials/running-inference-on-a-pretrained-model/ Then ran:

PYTHONPATH=. python allenact/main.py \
training_a_pointnav_model \
-o pretrained_model_ckpts/robothor-pointnav-rgb-resnet/ \
-b projects/tutorials \
-c pretrained_model_ckpts/robothor-pointnav-rgb-resnet/checkpoints/PointNavRobothorRGBPPO/2020-08-31_12-13-30/exp_PointNavRobothorRGBPPO__stage_00__steps_000039031200.pt \
--eval

Got the error:

[02/10 14:58:23 ERROR:] Traceback (most recent call last):
  File "/home/daniel/Documents/github/minigrid_experiments/third-party/allenact/allenact/algorithms/onpolicy_sync/engine.py", line 1992, in process_checkpoints
    eval_package = self.run_eval(
  File "/home/daniel/Documents/github/minigrid_experiments/third-party/allenact/allenact/algorithms/onpolicy_sync/engine.py", line 1782, in run_eval
    num_paused = self.initialize_storage_and_viz(
  File "/home/daniel/Documents/github/minigrid_experiments/third-party/allenact/allenact/algorithms/onpolicy_sync/engine.py", line 455, in initialize_storage_and_viz
    observations = self.vector_tasks.get_observations()
  File "/home/daniel/Documents/github/minigrid_experiments/third-party/allenact/allenact/algorithms/onpolicy_sync/engine.py", line 309, in vector_tasks
    self._vector_tasks = VectorSampledTasks(
  File "/home/daniel/Documents/github/minigrid_experiments/third-party/allenact/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 234, in __init__
    observation_spaces = [
  File "/home/daniel/Documents/github/minigrid_experiments/third-party/allenact/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 237, in <listcomp>
    for space in read_fn(timeout_to_use=5 * self.read_timeout if self.read_timeout is not None else None)  # type: ignore
  File "/home/daniel/Documents/github/minigrid_experiments/third-party/allenact/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 272, in read_with_timeout
    raise TimeoutError(
TimeoutError: Did not receive output from `VectorSampledTask` worker for 300 seconds.

Expected behavior

Able to run inference and save metrics to tensorboard.

Desktop

Please add the following information:

  • OS: Ubuntu 22.04
  • AllenAct Version: commit 24907f16cd6aace1abb2fef90c8e8667859c38b8

Additional context

Running on Python 3.8 in Anaconda

dtch1997 avatar Feb 10 '23 15:02 dtch1997