transformers KOSMOS-2 Entities giving null

System Info

google colab, T4

!pip install -q git+https://github.com/huggingface/transformers.git accelerate bitsandbytes

Who can help?

@amy

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

from transformers import AutoProcessor, AutoModelForVision2Seq

processor = AutoProcessor.from_pretrained("microsoft/kosmos-2-patch14-224")
model = AutoModelForVision2Seq.from_pretrained("microsoft/kosmos-2-patch14-224", load_in_4bit=True, device_map={"":0})

import requests
from PIL import Image

prompt = "An image of"

url = "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/snowman.png"
image = Image.open(requests.get(url, stream=True).raw)
image

inputs = processor(text=prompt, images=image, return_tensors="pt").to("cuda:0")

# autoregressively generate completion
generated_ids = model.generate(**inputs, max_new_tokens=128)
# convert generated token IDs back to strings
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(generated_text)

# By default, the generated text is cleaned up and the entities are extracted.
processed_text, entities = processor.post_process_generation(generated_text)

print(processed_text)
print(entities)

gives

An image of a snowman warming up by a fire.
[]

Expected behavior

needs to give entities https://github.com/NielsRogge/Transformers-Tutorials/blob/master/KOSMOS-2/Inference_with_KOSMOS_2_for_multimodal_grounding.ipynb

Jan 16 '24 03:01 andysingal

Hi @andysingal. To use Kosmos-2 for image grounding, you have to add a special <grounding> token before the prompt, as they do in the paper. Also you can use <phrase> token to get bboxes of specific phrases in the format prev_prompt_text <phrase>the_object</phrase>

import requests
from PIL import Image

processor = AutoProcessor.from_pretrained("microsoft/kosmos-2-patch14-224")
model = AutoModelForVision2Seq.from_pretrained("microsoft/kosmos-2-patch14-224", device_map="cpu")

prompt_grounded = "<grounding>An image of"
prompt_refer = "An image of <phrase>a snowman</phrase>"

url = "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/snowman.png"
image = Image.open(requests.get(url, stream=True).raw)
image

inputs = processor(text=prompt_refer, images=image, return_tensors="pt").to(model.device)

# autoregressively generate completion
generated_ids = model.generate(**inputs, max_new_tokens=100)
input_len = inputs['input_ids'].shape[-1]

# convert generated token IDs back to strings
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

# By default, the generated text is cleaned up and the entities are extracted.
processed_text, entities = processor.post_process_generation(generated_text)

print(processed_text)
print(entities)
>>> An image of a snowman warming himself by a campfire
>>> [('a snowman', (12, 21), [(0.390625, 0.046875, 0.984375, 0.828125)]), ('a campfire', (41, 51), [(0.109375, 0.640625, 0.546875, 0.984375)])]

Jan 17 '24 09:01 zucchini-nlp

Thank you very much

On Wed, Jan 17, 2024 at 3:08 PM Raushan Turganbay @.***> wrote:

Hi @andysingal https://github.com/andysingal. To use Kosmos-2 for image grounding, you have to add a special token before the prompt, as they do in the paper https://arxiv.org/abs/2306.14824. Also you can use token to get bboxes of specific phrases in the format prev_prompt_text the_object

import requests from PIL import Image

processor = AutoProcessor.from_pretrained("microsoft/kosmos-2-patch14-224") model = AutoModelForVision2Seq.from_pretrained("microsoft/kosmos-2-patch14-224", device_map="cpu")

prompt_grounded = "An image of" prompt_refer = "An image of a snowman"

url = "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/snowman.png" image = Image.open(requests.get(url, stream=True).raw) image

inputs = processor(text=prompt_refer, images=image, return_tensors="pt").to(model.device)

autoregressively generate completion

generated_ids = model.generate(**inputs, max_new_tokens=100) input_len = inputs['input_ids'].shape[-1]

convert generated token IDs back to strings

generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

By default, the generated text is cleaned up and the entities are extracted.

processed_text, entities = processor.post_process_generation(generated_text)

print(processed_text) print(entities)

An image of a snowman warming himself by a campfire [('a snowman', (12, 21), [(0.390625, 0.046875, 0.984375, 0.828125)]), ('a campfire', (41, 51), [(0.109375, 0.640625, 0.546875, 0.984375)])]

— Reply to this email directly, view it on GitHub https://github.com/huggingface/transformers/issues/28518#issuecomment-1895434657, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE4LJNNJZCFNQ6UKHCMAE3DYO6LYHAVCNFSM6AAAAABB4ENBYKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJVGQZTINRVG4 . You are receiving this because you were mentioned.Message ID: @.***>

Jan 17 '24 11:01 andysingal

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Feb 15 '24 08:02 github-actions[bot]

transformers transformers copied to clipboard

KOSMOS-2 Entities giving null

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

autoregressively generate completion

convert generated token IDs back to strings

By default, the generated text is cleaned up and the entities are extracted.

transformers
transformers copied to clipboard