mmf
mmf copied to clipboard
Pretrained models cannot replicate the experiment results in paper under run_type Val
❓ Questions and Help
When I try to follow the experiment and replicate the experiment results, I found that the released pretrained models cannot replicate the experiment results in paper under run_type Val.
CUDA_VISIBLE_DEVICES=0,1 mmf_run dataset=textvqa \
model=m4c \
config=projects/m4c/configs/textvqa/joint_with_stvqa.yaml \
env.save_dir=./save/m4c \
run_type=val \
checkpoint.resume_zoo=m4c.textvqa.with_stvqa \
env.data_dir= <data_dir_path>
When it is finished, the result is :
| INFO | mmf.train : val/total_loss: 7.2399, val/textvqa/m4c_decoding_bce_with_mask: 7.2399, val/textvqa/textvqa_accuracy: 0.3489
the difference between 0.3489
and 0.4055
(result in paper) is distinct.
However,When I 'finetune' the pretrained model :
CUDA_VISIBLE_DEVICES=0,1 mmf_run dataset=textvqa \
model=m4c \
config=projects/m4c/configs/textvqa/joint_with_stvqa.yaml \
env.save_dir=./save/m4c \
run_type=train_val \
checkpoint.resume_zoo=m4c.textvqa.with_stvqa \
env.data_dir=<data_dir_path>
When it is finished, the result is :
| INFO | mmf.train : progress: 21000/24000, val/total_loss: 7.5186, val/textvqa/m4c_decoding_bce_with_mask: 7.5186, val/textvqa/textvqa_accuracy: 0.3992
accuracy 0.3992
and 0.4055
(result in paper) are close.
Does it mean the released pretrained model is not the best model?
Hi @xzChiang, this seems weird. I just tried running this evaluation locally on a two-GPU machine, and got 0.4065 accuracy
mmf_run config=projects/m4c/configs/textvqa/joint_with_stvqa.yaml \
datasets=textvqa \
model=m4c \
run_type=val \
checkpoint.resume_zoo=m4c.textvqa.with_stvqa
Output:
val/textvqa/textvqa_accuracy: 0.4065
One thing that might make a difference is that we have updated the TextVQA features (with a slightly different feature extraction mechanism) in https://github.com/facebookresearch/mmf/pull/375 (merged around 07/02/2020). While MMF should automatically check for feature files, if you have downloaded TextVQA feature data before 07/02/2020, could you try deleting and re-downloading the TextVQA feature files? (If you delete them or move them to another location, the MMF script should automatically download them again during training or testing.)
I get the same problem. How to solve this ? Is there any configures or settings should be specially noticed or modified ?
I get the same problem. How to solve this ? Is there any configures or settings should be specially noticed or modified ?
@akira-l Did you download the TextVQA features and M4C zoo models before #375 (merged around 07/02/2020)? If so, you can try manually deleting the old versions of the M4C zoo models and the TextVQA dataset, or moving them to another location:
# the default data location of MMF (unless you have specified it otherwise)
export MMF_DATA=~/.cache/torch/mmf/data
# remove downloaded TextVQA dataset (you can also move it to another location)
mv ${MMF_DATA}/datasets/textvqa/ ${MMF_DATA}/datasets/textvqa_obsolete/
# remove downloaded M4C zoo models (you can also move it to another location)
rm -r ${MMF_DATA}/models/m4c.*/
unset MMF_DATA
@ronghanghu ,Thanks for your answer. But the MMF Version(after 07/02/2020)
is right, the features don't matter.
Additionally, When the run_type=val
experiment finished, there are warnings in the log:
| WARNING | py.warnings : .../mmf/mmf/utils/build.py:229: UserWarning: No type for scheduler specified even though lr_scheduler is True, setting default to 'Pythia'```
"No type for scheduler specified even though lr_scheduler is True, "
| WARNING | py.warnings : .../mmf/mmf/utils/build.py:235: UserWarning: scheduler attributes has no params defined, defaulting to {}.
warnings.warn("scheduler attributes has no params defined, defaulting to {}.")
| INFO | mmf.train : Loading checkpoint
| INFO | mmf.train : Key data_parallel is not present in registry, returning default value of None
| INFO | mmf.train : Key distributed is not present in registry, returning default value of None
| INFO | mmf.train : Key data_parallel is not present in registry, returning default value of None
| INFO | mmf.train : Key distributed is not present in registry, returning default value of None
| WARNING | py.warnings : .../mmf/mmf/utils/checkpoint.py:230: UserWarning: 'optimizer' key is not present in the checkpoint asked to be loaded. Skipping.
"'optimizer' key is not present in the "
When the run_type=train_val
experiment finished, there are similar warnings in log:
| WARNING | py.warnings : .../mmf/mmf/utils/distributed.py:273: UserWarning: No type for scheduler specified even though lr_scheduler is True, setting default to 'Pythia'
builtin_warn(*args, **kwargs)
| WARNING | py.warnings : .../mmf/mmf/utils/distributed.py:273: UserWarning: scheduler attributes has no params defined, defaulting to {}.
builtin_warn(*args, **kwargs)
| INFO | mmf.train : Loading checkpoint
| INFO | mmf.train : Key data_parallel is not present in registry, returning default value of None
| INFO | mmf.train : Key distributed is not present in registry, returning default value of None
| INFO | mmf.train : Key data_parallel is not present in registry, returning default value of None
| INFO | mmf.train : Key distributed is not present in registry, returning default value of None
| WARNING | py.warnings : .../mmf/mmf/utils/distributed.py:273: UserWarning: 'optimizer' key is not present in the checkpoint asked to be loaded. Skipping.
builtin_warn(*args, **kwargs)
Will those warnings affect the experiment results ? And how to solve those warnings? Thanks again.
@xzChiang @akira-l The warnings above should be fine and should not affect training or evaluation. (They are raised because the M4C model checkpoint is saved from an earlier Pythia/MMF version, and are loaded in the current version with backward compatibility.)
Meanwhile, could you verify the MD5 checksum of the features files (via md5sum
), where ${MMF_DATA}
should be ~/.cache/torch/mmf/data
by default?
MD5 checksum feature path
================================ =========================================
14a17d6e377e9e4ea5df6c1e262b3024 ${MMF_DATA}/datasets/textvqa/defaults/features/open_images/detectron.lmdb/data.mdb
3c10b1518ee6d30f166fb0fb1256c722 ${MMF_DATA}/datasets/textvqa/ocr_en/features/ocr_en_frcn_features.lmdb/data.mdb
and also the model files
MD5 checksum model path
================================ =========================================
877a7d29125a6ed3ff57db4f40e6bd23 ${MMF_DATA}/models/m4c.textvqa.with_stvqa/m4c.pth (for zoo model 'm4c.textvqa.with_stvqa')
877a7d29125a6ed3ff57db4f40e6bd23 ${MMF_DATA}/models/m4c.textvqa.defaults/m4c.pth (if you use 'm4c.textvqa.defaults' zoo key, which should be equivalent)
to make sure that we have the same feature and model files?
If your feature and model checksums match the values above, but the evaluation accuracy is still wrong, could you let me know your environment from the output of python -m torch.utils.collect_env
?
I check the MD5. There is no any mismatch in data or checkpoints.
@akira-l thanks for the update. Unfortunately I still could not reproduce this error on my end. Could you share your PyTorch environment (the output of python -m torch.utils.collect_env
)? Since your features correct and up-to-date, I suspect it could be due to some changes in PyTorch or other dependencies.
I am facing the same issue. Were you able to replicate the original val acc 0.4055 @akira-l @xzChiang ?
@ronghanghu This is the o/p of python -m torch.utils.collect_env
for my environment
Collecting environment information...
PyTorch version: 1.8.0
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.4 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: version 3.16.3
Python version: 3.7 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: 10.1.243
GPU models and configuration: GPU 0: NVIDIA RTX A6000
Nvidia driver version: 510.47.03
cuDNN version: Probably one of the following:
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn.so.8
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_adv_train.so.8
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_ops_train.so.8
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn.so.8
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_adv_train.so.8
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_ops_train.so.8
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.19.2
[pip3] pytorch-lightning==1.6.0.dev0
[pip3] torch==1.8.0
[pip3] torchaudio==0.8.0a0+a751e1d
[pip3] torchmetrics==0.6.2
[pip3] torchtext==0.5.0
[pip3] torchvision==0.9.0
[conda] blas 1.0 mkl
[conda] cudatoolkit 11.1.1 h6406543_8 conda-forge
[conda] ffmpeg 4.3 hf484d3e_0 pytorch
[conda] mkl 2020.2 256
[conda] mkl-service 2.3.0 py37h8f50634_2 conda-forge
[conda] mkl_fft 1.2.0 py37h161383b_1 conda-forge
[conda] mkl_random 1.2.0 py37h9fdb41a_1 conda-forge
[conda] numpy 1.19.2 py37h54aff64_0
[conda] numpy-base 1.19.2 py37hfa32c7d_0
[conda] pytorch 1.8.0 py3.7_cuda11.1_cudnn8.0.5_0 pytorch
[conda] pytorch-lightning 1.6.0.dev0 pypi_0 pypi
[conda] torchaudio 0.8.0 py37 pytorch
[conda] torchmetrics 0.6.2 pypi_0 pypi
[conda] torchtext 0.5.0 pypi_0 pypi
[conda] torchvision 0.9.0 py37_cu111 pytorch
Is there anything wrong here? I am getting val textvqa_accuracy: 0.3484
Does someone have any update on this? @akira-l @ronghanghu @xzChiang