habitat-lab
habitat-lab copied to clipboard
Why does VQA model evaluation fall into an infinite loop?
Habitat-Lab and Habitat-Sim versions
Habitat-Lab: v0.2.1 (stable)
Habitat-Sim: v0.2.1
❓ Questions and Help
Why does habitat_baselines.utils.common.poll_checkpoint_folder
return None
when it has passed all the checkpoints? With None
is returned, BaseTrainer.eval()
falls into an infinite loop. Is this the expected behavior?
Or maybe habitat_baselines/config/eqa/il_vqa.yaml
should contain EVAL_CKPT_PATH_DIR: "data/eqa/vqa/checkpoints/epoch_50.ckpt"
instead EVAL_CKPT_PATH_DIR: "data/eqa/vqa/checkpoints/"
?
UPD 01/28/2022: hmm, it seems I've understood the idea: we want to evaluate checkpoints as they are created! But maybe it's worth pointing this out in the README, because I was at a loss when I first came across the infinite loop. I would close the issue, but the question below does not give.(
Also, I have an error when evaluating the nav module with python -u habitat_baselines/run.py --exp-config habitat_baselines/config/eqa/il_pacman_nav.yaml --run-type eval
:
File "/home/svyatoslav/anaconda3/envs/habitat/lib/python3.6/site-packages/torch-1.10.0-py3.6-linux-x86_64.egg/torch/utils/tensorboard/summary.py", line 490, in make_video
clip.write_gif(filename, verbose=False, progress_bar=False)
TypeError: write_gif() got an unexpected keyword argument 'verbose'
It seems to be because of
moviepy 2.0.0.dev2 pypi_0 pypi
When I installed moviepy=1.0.1
the error seems to have disappeared. Buuut, now I'm having another error:
Traceback (most recent call last):
File "habitat_baselines/run.py", line 85, in <module>
main()
File "habitat_baselines/run.py", line 40, in main
run_exp(**vars(args))
File "habitat_baselines/run.py", line 81, in run_exp
execute_exp(config, run_type)
File "habitat_baselines/run.py", line 66, in execute_exp
trainer.eval()
File "/home/svyatoslav/Internship/EQA/habitat-lab/habitat_baselines/common/base_trainer.py", line 129, in eval
checkpoint_index=prev_ckpt_ind,
File "/home/svyatoslav/Internship/EQA/habitat-lab/habitat_baselines/il/trainers/pacman_trainer.py", line 423, in _eval_checkpoint
config.IL.NAV.max_controller_actions,
File "/home/svyatoslav/Internship/EQA/habitat-lab/habitat_baselines/il/data/nav_data.py", line 237, in get_hierarchical_features_till_spawn
raw_img_feats[target_pos_idx].copy()
IndexError: index 94 is out of bounds for axis 0 with size 93
I0124 10:37:28.954592 10896 Simulator.cpp:54] Deconstructing Simulator
How can I fix it?
CC: @mukulkhanna
@TopCoder2K 0. EQA IL part is contributed/maintained by @mukulkhanna. It may be slightly outdated from recent changes/libraries appeared as we have no Matterport3D scenes on Continuous Integration machines to test this part of the code.
UPD 01/28/2022: hmm, it seems I've understood the idea: we want to evaluate checkpoints as they are created! But maybe it's worth pointing this out in the README, because I was at a loss when I first came across the infinite loop. I would close the issue, but the question below does not give.(
- Correct, that is most common use-case, when val curve has to be created during the training, but without pausing training itself. Would you mind to send a PR with clarification you would like to see? Thank you!
- Freezing
moviepy=1.0.1
in dependencies sounds like good idea and would be great to send as PR. - For some reason
len(raw_img_feats)
is less thanlen(actions)
orbacktrack_steps < 0
. Possibly, you need to add additional logging to understand what is causing the error.
@mathfac Thank you for the detailed response!
- Hmm, the README.md looks really good, so maybe it's enough to be here in the issues. Or maybe it's worth adding a footnote to the 'Eval' section. What do you think?
- Then I also want to ask if you have encountered the problem of missing
Cython
,pkgconfig
andh5py
? I had to install them beforepip install -r requirements.txt
had finished without errors. Do they also have to be added? And do I need to commit it in a special branch (such asil_fixes
) or do I need just commit it in the main and open PR? - I'll try to find the reason, as I need this functionality in my research :) If you have any other ideas, please share them!