LAVIS icon indicating copy to clipboard operation
LAVIS copied to clipboard

How to interpret NLVR model outputs and input labels

Open quickgrid opened this issue 2 years ago • 2 comments

As I understand from looking at blip paper, NLVR takes pair of images, a sentence for them and predicts if the sentence describes the image pair.

I have used the following code to generate output from comparing two images and their text input with total 3 comparisons on minibatch.

  • During predict I have to pass labels in samples dict. Are the values of labels only 0, 1 for False, True, or something else?
  • Each image pair in minibatch outputs two values in predictions. How to interpret these output predictions values?

Code

model, vis_processors, text_processors = load_model_and_preprocess("blip_nlvr", "nlvr", device=device, is_eval=True)

samples = {
    "image0": torch.randn((3, 3, 384, 384), device=device),
    "image1": torch.randn((3, 3, 384, 384), device=device),
    "text_input": [
        "there is a car with yellow color",
        "there are cars in one of the images",
        "there are bikes in both images"
    ],
    "label": torch.tensor([0, 1, 1], device=device),
}

with torch.no_grad():
    output = model.predict(samples)

Output

{'predictions': tensor([[ 0.6208, -0.7106],
         [ 0.6987, -0.7888],
         [ 1.3222, -1.4706]], device='cuda:0'),
 'targets': tensor([0, 1, 1], device='cuda:0')}

quickgrid avatar Oct 17 '22 16:10 quickgrid

Hi @quickgrid , thanks for your interest.

You are right - 0/1 tensor is expected. 0 for False, 1 for True. The output is binary logit. If applying softmax on the logit, you'll get probabilities for False / True prediction.

Please see our NLVR dataset module for more details.

Thanks.

dxli94 avatar Oct 18 '22 01:10 dxli94

Thank you @dxli94 for quick response and clearing my confusion.

I have looked at the linked pytorch dataset code and an actual dataset sample. Now the input labels and outputs as predicated label binary logit makes sense.

One more question, how to get attention map of nlvr model in one or both images like this text localization example?

quickgrid avatar Oct 19 '22 05:10 quickgrid