mmf icon indicating copy to clipboard operation
mmf copied to clipboard

Pretrained models cannot replicate the experiment results in paper under run_type Val

Open SoundingSilence opened this issue 3 years ago • 10 comments

❓ Questions and Help

When I try to follow the experiment and replicate the experiment results, I found that the released pretrained models cannot replicate the experiment results in paper under run_type Val.

CUDA_VISIBLE_DEVICES=0,1 mmf_run dataset=textvqa \ model=m4c \ config=projects/m4c/configs/textvqa/joint_with_stvqa.yaml \ env.save_dir=./save/m4c \ run_type=val \ checkpoint.resume_zoo=m4c.textvqa.with_stvqa \ env.data_dir= <data_dir_path> When it is finished, the result is : | INFO | mmf.train : val/total_loss: 7.2399, val/textvqa/m4c_decoding_bce_with_mask: 7.2399, val/textvqa/textvqa_accuracy: 0.3489 the difference between 0.3489 and 0.4055(result in paper) is distinct.

However,When I 'finetune' the pretrained model : CUDA_VISIBLE_DEVICES=0,1 mmf_run dataset=textvqa \ model=m4c \ config=projects/m4c/configs/textvqa/joint_with_stvqa.yaml \ env.save_dir=./save/m4c \ run_type=train_val \ checkpoint.resume_zoo=m4c.textvqa.with_stvqa \ env.data_dir=<data_dir_path> When it is finished, the result is : | INFO | mmf.train : progress: 21000/24000, val/total_loss: 7.5186, val/textvqa/m4c_decoding_bce_with_mask: 7.5186, val/textvqa/textvqa_accuracy: 0.3992 accuracy 0.3992 and 0.4055(result in paper) are close. Does it mean the released pretrained model is not the best model?

SoundingSilence avatar Sep 03 '20 05:09 SoundingSilence

Hi @xzChiang, this seems weird. I just tried running this evaluation locally on a two-GPU machine, and got 0.4065 accuracy

mmf_run config=projects/m4c/configs/textvqa/joint_with_stvqa.yaml \
    datasets=textvqa \
    model=m4c \
    run_type=val \
    checkpoint.resume_zoo=m4c.textvqa.with_stvqa

Output:

val/textvqa/textvqa_accuracy: 0.4065

One thing that might make a difference is that we have updated the TextVQA features (with a slightly different feature extraction mechanism) in https://github.com/facebookresearch/mmf/pull/375 (merged around 07/02/2020). While MMF should automatically check for feature files, if you have downloaded TextVQA feature data before 07/02/2020, could you try deleting and re-downloading the TextVQA feature files? (If you delete them or move them to another location, the MMF script should automatically download them again during training or testing.)

ronghanghu avatar Sep 03 '20 17:09 ronghanghu

I get the same problem. How to solve this ? Is there any configures or settings should be specially noticed or modified ?

akira-l avatar Sep 04 '20 05:09 akira-l

I get the same problem. How to solve this ? Is there any configures or settings should be specially noticed or modified ?

@akira-l Did you download the TextVQA features and M4C zoo models before #375 (merged around 07/02/2020)? If so, you can try manually deleting the old versions of the M4C zoo models and the TextVQA dataset, or moving them to another location:

# the default data location of MMF (unless you have specified it otherwise)
export MMF_DATA=~/.cache/torch/mmf/data

# remove downloaded TextVQA dataset (you can also move it to another location)
mv ${MMF_DATA}/datasets/textvqa/ ${MMF_DATA}/datasets/textvqa_obsolete/
# remove downloaded M4C zoo models (you can also move it to another location)
rm -r ${MMF_DATA}/models/m4c.*/

unset MMF_DATA

ronghanghu avatar Sep 04 '20 05:09 ronghanghu

@ronghanghu ,Thanks for your answer. But the MMF Version(after 07/02/2020) is right, the features don't matter. Additionally, When the run_type=val experiment finished, there are warnings in the log:

| WARNING | py.warnings : .../mmf/mmf/utils/build.py:229: UserWarning: No type for scheduler specified even though lr_scheduler is True, setting default to 'Pythia'```
 "No type for scheduler specified even though lr_scheduler is True, " 

| WARNING | py.warnings : .../mmf/mmf/utils/build.py:235: UserWarning: scheduler attributes has no params defined, defaulting to {}.
  warnings.warn("scheduler attributes has no params defined, defaulting to {}.")
 | INFO | mmf.train : Loading checkpoint
 | INFO | mmf.train : Key data_parallel is not present in registry, returning default value of None
 | INFO | mmf.train : Key distributed is not present in registry, returning default value of None
 | INFO | mmf.train : Key data_parallel is not present in registry, returning default value of None
 | INFO | mmf.train : Key distributed is not present in registry, returning default value of None
| WARNING | py.warnings : .../mmf/mmf/utils/checkpoint.py:230: UserWarning: 'optimizer' key is not present in the checkpoint asked to be loaded. Skipping.
  "'optimizer' key is not present in the "

When the run_type=train_val experiment finished, there are similar warnings in log:

 | WARNING | py.warnings : .../mmf/mmf/utils/distributed.py:273: UserWarning: No type for scheduler specified even though lr_scheduler is True, setting default to 'Pythia'
  builtin_warn(*args, **kwargs)

 | WARNING | py.warnings : .../mmf/mmf/utils/distributed.py:273: UserWarning: scheduler attributes has no params defined, defaulting to {}.
  builtin_warn(*args, **kwargs)
 | INFO | mmf.train : Loading checkpoint
 | INFO | mmf.train : Key data_parallel is not present in registry, returning default value of None
 | INFO | mmf.train : Key distributed is not present in registry, returning default value of None
 | INFO | mmf.train : Key data_parallel is not present in registry, returning default value of None
 | INFO | mmf.train : Key distributed is not present in registry, returning default value of None
 | WARNING | py.warnings : .../mmf/mmf/utils/distributed.py:273: UserWarning: 'optimizer' key is not present in the checkpoint asked to be loaded. Skipping.
  builtin_warn(*args, **kwargs)

Will those warnings affect the experiment results ? And how to solve those warnings? Thanks again.

SoundingSilence avatar Sep 04 '20 10:09 SoundingSilence

@xzChiang @akira-l The warnings above should be fine and should not affect training or evaluation. (They are raised because the M4C model checkpoint is saved from an earlier Pythia/MMF version, and are loaded in the current version with backward compatibility.)

Meanwhile, could you verify the MD5 checksum of the features files (via md5sum), where ${MMF_DATA} should be ~/.cache/torch/mmf/data by default?

MD5 checksum                      feature path
================================  =========================================
14a17d6e377e9e4ea5df6c1e262b3024  ${MMF_DATA}/datasets/textvqa/defaults/features/open_images/detectron.lmdb/data.mdb
3c10b1518ee6d30f166fb0fb1256c722  ${MMF_DATA}/datasets/textvqa/ocr_en/features/ocr_en_frcn_features.lmdb/data.mdb

and also the model files

MD5 checksum                      model path
================================  =========================================
877a7d29125a6ed3ff57db4f40e6bd23  ${MMF_DATA}/models/m4c.textvqa.with_stvqa/m4c.pth (for zoo model 'm4c.textvqa.with_stvqa')
877a7d29125a6ed3ff57db4f40e6bd23  ${MMF_DATA}/models/m4c.textvqa.defaults/m4c.pth (if you use 'm4c.textvqa.defaults' zoo key, which should be equivalent)

to make sure that we have the same feature and model files?


If your feature and model checksums match the values above, but the evaluation accuracy is still wrong, could you let me know your environment from the output of python -m torch.utils.collect_env?

ronghanghu avatar Sep 04 '20 15:09 ronghanghu

I check the MD5. There is no any mismatch in data or checkpoints.

akira-l avatar Sep 06 '20 05:09 akira-l

@akira-l thanks for the update. Unfortunately I still could not reproduce this error on my end. Could you share your PyTorch environment (the output of python -m torch.utils.collect_env)? Since your features correct and up-to-date, I suspect it could be due to some changes in PyTorch or other dependencies.

ronghanghu avatar Sep 06 '20 15:09 ronghanghu

I am facing the same issue. Were you able to replicate the original val acc 0.4055 @akira-l @xzChiang ?

shwetkm avatar May 09 '22 07:05 shwetkm

@ronghanghu This is the o/p of python -m torch.utils.collect_env for my environment

Collecting environment information...
PyTorch version: 1.8.0
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.4 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: version 3.16.3

Python version: 3.7 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: 10.1.243
GPU models and configuration: GPU 0: NVIDIA RTX A6000
Nvidia driver version: 510.47.03
cuDNN version: Probably one of the following:
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn.so.8
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_adv_train.so.8
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_ops_train.so.8
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn.so.8
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_adv_train.so.8
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_ops_train.so.8
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.2
[pip3] pytorch-lightning==1.6.0.dev0
[pip3] torch==1.8.0
[pip3] torchaudio==0.8.0a0+a751e1d
[pip3] torchmetrics==0.6.2
[pip3] torchtext==0.5.0
[pip3] torchvision==0.9.0
[conda] blas                      1.0                         mkl  
[conda] cudatoolkit               11.1.1               h6406543_8    conda-forge
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] mkl                       2020.2                      256  
[conda] mkl-service               2.3.0            py37h8f50634_2    conda-forge
[conda] mkl_fft                   1.2.0            py37h161383b_1    conda-forge
[conda] mkl_random                1.2.0            py37h9fdb41a_1    conda-forge
[conda] numpy                     1.19.2           py37h54aff64_0  
[conda] numpy-base                1.19.2           py37hfa32c7d_0  
[conda] pytorch                   1.8.0           py3.7_cuda11.1_cudnn8.0.5_0    pytorch
[conda] pytorch-lightning         1.6.0.dev0               pypi_0    pypi
[conda] torchaudio                0.8.0                      py37    pytorch
[conda] torchmetrics              0.6.2                    pypi_0    pypi
[conda] torchtext                 0.5.0                    pypi_0    pypi
[conda] torchvision               0.9.0                py37_cu111    pytorch

Is there anything wrong here? I am getting val textvqa_accuracy: 0.3484

shwetkm avatar May 09 '22 19:05 shwetkm

Does someone have any update on this? @akira-l @ronghanghu @xzChiang

shwetkm avatar Jun 25 '22 11:06 shwetkm