Non-reproducible MSRVTT results - I get R@1 accuracy less than 1%
I am trying to verify/reproduce your paper's validation results without training it myself and expected 42.6% R@1 accuracy for MSR-VTT.
But when I follow the instructions from TRAIN_AND_VALIDATE.md (I only did the eval.sh, no training) I get results that are as bad as randomly guessing with about 0.1% R@1 accuracy. See my out.log here:
Eval Epoch: 0, eval Video-Text Retrieval under MSRVTT test data 2024-04-21,14:07:56 | INFO | MSRVTT sim matrix size: 1000, 1000 2024-04-21,15:02:43 | INFO | Length-T: 1000, Length-V:1000 2024-04-21,15:02:47 | INFO | MSRVTT Text-to-Video: 2024-04-21,15:02:53 | INFO | >>> R@1: 0.0 - R@5: 0.6 - R@10: 0.8 - Median R: 516.0 - Mean R: 518.7 2024-04-21,15:03:00 | INFO | MSRVTT Video-to-Text: 2024-04-21,15:03:03 | INFO | >>> V2T$R@1: 0.1 - V2T$R@5: 0.6 - V2T$R@10: 0.8 - V2T$Median R: 491.0 - V2T$Mean R: 498.2
What I need:
Please tell me how i can select your final model for the eval script, which will lead to the same results you that you published.
What I suspect is wrong:
Well, I guess the issue is that I am trying to evaluate the untrained model here instead of your trained version. Maybe I misunderstood the instructions, and the pretrained weights I downloaded are not the same as your fully trained model described in the paper.
I have also tried to get your final model by running my eval_msrvtt.sh script with the TRANSFORMERS_OFFLINE=0 environment variable and an empty cache_dir in hopes of downloading the fully trained version. Strangely enough this leads to slightly different results in my out.log:
2024-04-19,13:59:28 | INFO | downloading https://huggingface.co/laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K/resolve/main/tokenizer_config.json to /raid/1moritz/models/languagebind/cache_dir/tmpctkzbg3u 2024-04-19,13:59:29 | INFO | downloading https://huggingface.co/laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K/resolve/main/vocab.json to /raid/1moritz/models/languagebind/cache_dir/tmp6_ww7ayw 2024-04-19,13:59:29 | INFO | downloading https://huggingface.co/laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K/resolve/main/merges.txt to /raid/1moritz/models/languagebind/cache_dir/tmp3g7ehptb 2024-04-19,13:59:30 | INFO | downloading https://huggingface.co/laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K/resolve/main/tokenizer.json to /raid/1moritz/models/languagebind/cache_dir/tmp4h042saq 2024-04-19,13:59:31 | INFO | downloading https://huggingface.co/laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K/resolve/main/special_tokens_map.json to /raid/1moritz/models/languagebind/cache_dir/tmp0exqanes 2024-04-19,13:59:31 | INFO | {'vl_ret': [{'msrvtt': <torch.utils.data.dataloader.DataLoader object at 0x7f9015f066b0>}]}) 2024-04-19,13:59:31 | INFO | Eval Epoch: 0, eval Video-Text Retrieval under MSRVTT test data 2024-04-19,14:06:35 | INFO | MSRVTT sim matrix size: 1000, 1000 2024-04-19,14:06:35 | INFO | Length-T: 1000, Length-V:1000 2024-04-19,14:06:35 | INFO | MSRVTT Text-to-Video: 2024-04-19,14:06:35 | INFO | >>> R@1: 0.0 - R@5: 0.4 - R@10: 0.7 - Median R: 511.0 - Mean R: 505.5 2024-04-19,14:06:35 | INFO | MSRVTT Video-to-Text: 2024-04-19,14:06:35 | INFO | >>> V2T$R@1: 0.2 - V2T$R@5: 0.6 - V2T$R@10: 0.9 - V2T$Median R: 500.0 - V2T$Mean R: 504.9
How to reproduce:
I follow TRAIN_AND_VALIDATE.md.
- Download cache of pretrained weights from your google drive and specify CACHE_DIR.
- Download MSRVTT from the source you mentioned in TRAIN_AND_VALIDATE.md
- Change the data_root here.
- Make minimal changes to
eval.shand save it aseval_msrvtt.sh. Then execute the script.
This is my eval_msrvtt.sh:
CACHE_DIR="/raid/1moritz/models/languagebind/cache_dir"
RESUME="video_language.pt"
ANNOTATION="path/to/data"
# this script is for 640 total batch_size (n(16) GPUs * batch_size(10) * accum_freq(4))
cd /srv/home/1moritz/Repositories/LanguageBind
# TORCH_DISTRIBUTED_DEBUG=DETAIL HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 torchrun --nnodes=$HOST_NUM --node_rank=$INDEX --nproc_per_node $HOST_GPU_NUM --master_addr $CHIEF_IP \
TORCH_DISTRIBUTED_DEBUG=DETAIL HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 torchrun --nproc_per_node 1 \
-m main \
--train-data ${ANNOTATION} \
--train-num-samples 3020000 \
--clip-type "vl" --add-time-attn \
--lock-text --lock-image --text-type "polish_mplug" \
--init-temp 0.07 --learn-temp \
--model "ViT-L-14" --cache-dir ${CACHE_DIR} \
--convert_to_lora --lora_r 16 \
--lr 1e-4 --coef-lr 1 \
--beta1 0.9 --beta2 0.98 --wd 0.2 --eps 1e-6 \
--num-frames 8 --force-patch-dropout 0.3 \
--epochs 16 --batch-size 10 --accum-freq 4 --warmup 2000 \
--precision "amp" --workers 10 --video-decode-backend "imgs" \
--save-frequency 1 --log-every-n-steps 20 --report-to "tensorboard" --resume "latest" \
--do_eval \
--val_vl_ret_data "msrvtt"
Hi @lennartmoritz, I'm currently using this model for my project and I'm having the same issue with eval_msrvtt.sh.
I wrote my own script for model evaluation. Unfortunatelly, FT models does not show the expected results, but Large models are ok (LanguageBind_Video, LanguageBind_Audio)
You may try run my script, it gave me around 41.50 R@1, 65.80 R@5, 75.50 R@10
from collections import defaultdict
import torch
import pandas as pd
import numpy as np
from more_itertools import chunked
from tqdm.auto import tqdm
from languagebind import LanguageBind, to_device, transform_dict, LanguageBindImageTokenizer
def compute_metrics(x):
sx = np.sort(-x, axis=1)
d = np.diag(-x)
d = d[:, np.newaxis]
ind = sx - d
ind = np.where(ind == 0)
ind = ind[1]
metrics = {}
metrics['R1'] = float(np.sum(ind == 0)) * 100 / len(ind)
metrics['R5'] = float(np.sum(ind < 5)) * 100 / len(ind)
metrics['R10'] = float(np.sum(ind < 10)) * 100 / len(ind)
metrics['MR'] = np.median(ind) + 1
metrics["MedianR"] = metrics['MR']
metrics["MeanR"] = np.mean(ind) + 1
# metrics["cols"] = [int(i) for i in list(ind)]
return metrics
def main():
device = torch.device('cuda:0')
clip_type = {
'video': 'LanguageBind_Video',#_FT', # also LanguageBind_Video
'audio': 'LanguageBind_Audio',#_FT', # also LanguageBind_Audio
# 'image': 'LanguageBind_Image',
# 'thermal': 'LanguageBind_Thermal',
# 'depth': 'LanguageBind_Depth',
}
model = LanguageBind(clip_type=clip_type, cache_dir='./cache_dir').to(device)
model.eval()
tokenizer = LanguageBindImageTokenizer.from_pretrained('lb203/LanguageBind_Image', cache_dir='./cache_dir/tokenizer_cache_dir')
modality_transform = {c: transform_dict[c](model.modality_config[c]) for c in clip_type.keys()}
df = pd.read_csv('../data/MSRVTT/MSRVTT_JSFUSION_test.csv')
language_data = df['sentence'].values.tolist()
video_data = df['video_id'].apply(lambda x: str(f'../data/MSRVTT/videos/all/{x}.mp4')).values.tolist()
def embed(x: list[list], dtypes: list[str]) -> list:
inputs = {}
for data, dtype in zip(x, dtypes):
if dtype == 'language':
inputs['language'] = to_device(tokenizer(data, max_length=77, padding='max_length', truncation=True, return_tensors='pt'), device)
elif dtype in ['image', 'video', 'audio', 'depth', 'thermal', 'language']:
inputs[dtype] = to_device(modality_transform[dtype](data), device)
else:
raise
with torch.no_grad():
embeddings = model(inputs)
embeddings = {k: v.detach().cpu().numpy() for k, v in embeddings.items()}
return embeddings
batch_size = 16
results = defaultdict(lambda: np.random.rand(0, 768))
for batch in tqdm(list(zip(
chunked(language_data, batch_size),
chunked(video_data, batch_size)
))):
embeddings = embed(
batch,
dtypes=['language', 'video']
)
results['language'] = np.concatenate([results['language'], embeddings['language']])
results['video'] = np.concatenate([results['video'], embeddings['video']])
video = results['video']
language = results['language']
np.save('experiments/MSR-VTT_test_video_embeddings.npy', video)
np.save('experiments/MSR-VTT_test_language_embeddings.npy', language)
sim_matrix = torch.tensor(video @ language.T)
print('VT', compute_metrics(sim_matrix))
print('TV', compute_metrics(sim_matrix.T))
if __name__ == '__main__':
main()
Hey @e1four15f thank you for your code example. In the mean time, i wrote a similar script to yours based on the inference example script from the repo. But i've noticed, that this is considerably slower than when i used the eval script. I suspect it has to do with the used batch sizes. Have you found a way to select a batch size for inference with your script?