BLIP
BLIP copied to clipboard
Weird caption for a picture of flower
I got a weird caption on a picture of flower and don't know why:( Hope for some advice
Model: model_base_capfilt_large.pth sha256: 8f5187458d4d47bb87876faf3038d5947eff17475edf52cf47b62e84da0b235f
some core codes:
device = torch.device("cpu")
image_size = 224
image_path = "xxx" # say we read image by path
model = blip_decoder("checkpoints/model_base_capfilt_large.pth", image_size=image_size, vit="base")
model.eval()
model = model.to(device)
raw_image = Image.open(image_path).convert("RGB")
transform = transforms.Compose([
transforms.Resize((image_size, image_size), interpolation=InterpolationMode.BICUBIC),
transforms.ToTensor(),
transforms.Normalize((0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711))
])
image = transform(raw_image).unsqueeze(0).to(device)
with torch.no_grad():
# beam search
t0 = time.time()
caption = model.generate(image, sample=False, num_beams=3, max_length=20, min_length=5)[0]
cost = time.time() - t0
print(caption)
Output: dai dai dai dai dai dai dai dai dai dai dai dai dai dai dai dai
Here's the input image
Update Same code and model, test other two pictures of flower, similar to picture above
Pic1: a bunch of dai dai dai dai dai dai dai dai dai dai dai dai dai
Pic2: a bunch of yellow flowers
According to pic1 caption, model just thinks dai
is a kind of flower, and these flowers also called daisy
:)
Thanks for posting this interesting behavior from the model, this is new to me :)
So any advice for improvment?
may be params in caption = model.generate(image, sample=False, num_beams=3, max_length=20, min_length=5)[0]
will help
you may want to try the image captioning model finetuned on COCO
nucleus sampling also doesnt do this behaviour. The way I see it is that the beam search tries to fill the min length but gets stuck on the same thing when the picture is simple and there is not much else to say.