LAVIS
LAVIS copied to clipboard
How to interpret NLVR model outputs and input labels
As I understand from looking at blip paper, NLVR takes pair of images, a sentence for them and predicts if the sentence describes the image pair.
I have used the following code to generate output from comparing two images and their text input with total 3 comparisons on minibatch.
-
During predict I have to pass
labels
in samples dict. Are the values of labels only 0, 1 for False, True, or something else? - Each image pair in minibatch outputs two values in predictions. How to interpret these output predictions values?
Code
model, vis_processors, text_processors = load_model_and_preprocess("blip_nlvr", "nlvr", device=device, is_eval=True)
samples = {
"image0": torch.randn((3, 3, 384, 384), device=device),
"image1": torch.randn((3, 3, 384, 384), device=device),
"text_input": [
"there is a car with yellow color",
"there are cars in one of the images",
"there are bikes in both images"
],
"label": torch.tensor([0, 1, 1], device=device),
}
with torch.no_grad():
output = model.predict(samples)
Output
{'predictions': tensor([[ 0.6208, -0.7106],
[ 0.6987, -0.7888],
[ 1.3222, -1.4706]], device='cuda:0'),
'targets': tensor([0, 1, 1], device='cuda:0')}
Hi @quickgrid , thanks for your interest.
You are right - 0/1 tensor is expected. 0 for False, 1 for True. The output is binary logit. If applying softmax on the logit, you'll get probabilities for False / True prediction.
Please see our NLVR dataset module for more details.
Thanks.
Thank you @dxli94 for quick response and clearing my confusion.
I have looked at the linked pytorch dataset code and an actual dataset sample. Now the input labels and outputs as predicated label binary logit makes sense.
One more question, how to get attention map of nlvr model in one or both images like this text localization example?