LAVIS
LAVIS copied to clipboard
BLIP2 Text Localization
Hey all,
First of all thanks for the cool project and the shared checkpoints. I was wondering if there is any way to extract attention maps with respect to all query tokens using the QFormer module. Theoretically it should still have a similar cross-attention module that BLIP had within the text encoder's base model, but I cant't find a way to access this information with normal callbacks.
All help is appreciated! Thanks a lot, David