LAVIS icon indicating copy to clipboard operation
LAVIS copied to clipboard

Text tokenizer difference between foward and extract_feature

Open s7ev3n opened this issue 1 year ago • 3 comments

Hi,

I notice that in blip2_qformer.py, in the forward function, the text_tokens are truncated to max_length which is 32, while in extract_feature function which to my understanding is an inference function , the text_tokens are not truncated, which could be much larger than in the training which is the forward function.

May I ask why is the difference ? I especially do not understand why text token is restricted to 32 in training.

Looking forward to the answer :) Thanks

s7ev3n avatar Aug 17 '23 01:08 s7ev3n

+1 to this, since when I try to use the blip model's get_features functions, I get differently sized sequence dimensions for the returned text embeddings across batches (sometimes B, 19, 768, sometimes B, 21, 768). I call it as: features_multimodal_txt = self.model.extract_features(sample_copy, mode="text").text_embeds

Shouldn't it all be padded to the same max length?

gunshi avatar Nov 09 '23 17:11 gunshi

Hello all,

I am facing the same problem. Did you manage to find any workaround?

Thanks a lot ;)

billpsomas avatar Jan 26 '24 17:01 billpsomas

grab the first token returned. It corresponds to [CLS] token. This is a standard practice in LLM transformers. See this notebook - they grab the first token https://github.com/salesforce/LAVIS/blob/main/examples/blip2_feature_extraction.ipynb

philkuz avatar Mar 27 '24 22:03 philkuz