VBench RunTime Error of metric 'scene', when decoding with BertLMHeadModel

When I tried to evaluate a bunch of generated videos on the metric 'scene', I encountered the following problem:

File "/xxx/anaconda3/envs/vbench/lib/python3.10/site-packages/vbench/third_party/tag2Text/tag2text.py", line 192, in generate outputs = self.text_decoder.generate(input_ids=input_ids, File "/xxx/anaconda3/envs/vbench/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1928, in __getattr__ raise AttributeError( AttributeError: 'BertLMHeadModel' object has no attribute 'generate'

And it seems to make sense for me:

In the function compute_scene of scene.py, we define model through function tag2text_caption in third_party/tag2Text/tag2text.py, which is linked to module Tag2Text_Caption.
In Tag2Text_Caption, we claim that self.text_decoder = BertLMHeadModel(config=decoder_config), and call self.text_decoder.generate no matter whether sample=True in function generate.
BertLMHeadModel in third_party/tag2Text/med.py actually does not have such function generate, and its ancestors BertPreTrainedModel and PreTrainedModel also do not define generate.

Could anyone help me to solve the problem?

Jun 24 '25 21:06 WenkunHe

I meet the same problem.

Aug 15 '25 02:08 Lihui-Gu

@WenkunHe @Lihui-Gu Hi, may I know what version of transformers you are using?

Aug 15 '25 03:08 yinanhe

@yinanhe I met the same problem as well. And I am using transformers==4.33.2

Nov 06 '25 01:11 yingShen-ys

@yingShen-ys Hello, after testing, the transformers in version 4.33.2 can perform normal inference. You can refer to this issue https://github.com/xinyu1205/recognize-anything/issues/218.

Nov 06 '25 03:11 yinanhe

@yingShen-ys Hello, after testing, the transformers in version 4.33.2 can perform normal inference. You can refer to this issue xinyu1205/recognize-anything#218.

@yinanhe Thank you for the help! I also encountered another following issue, where it appears that many parameters are not properly initialized. This also happened when evaluating the metric 'scene'. Is this expected behavior? I installed vbench via pip install vbench.


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: 
['bert.encoder.layer.3.attention.output.LayerNorm.bias', .... ,'bert.encoder.layer.3.attention.self.query.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

Some weights of BertModel were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: 
['bert.encoder.layer.0.crossattention.self.value.weight',..., 'bert.encoder.layer.1.crossattention.self.value.bias', 'bert.encoder.layer.0.crossattention.output.dense.weight', 'bert.encoder.layer.1.crossattention.output.LayerNorm.bias', 'bert.encoder.layer.0.crossattention.output.dense.bias']

You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 30524. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc

load checkpoint from /home/.cache/vbench/caption_model/tag2text_swin_14m.pth

probably also related to #151

Nov 06 '25 05:11 yingShen-ys