transformers
transformers copied to clipboard
KOSMOS-2 Entities giving null
System Info
google colab, T4
!pip install -q git+https://github.com/huggingface/transformers.git accelerate bitsandbytes
Who can help?
@amy
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
from transformers import AutoProcessor, AutoModelForVision2Seq
processor = AutoProcessor.from_pretrained("microsoft/kosmos-2-patch14-224")
model = AutoModelForVision2Seq.from_pretrained("microsoft/kosmos-2-patch14-224", load_in_4bit=True, device_map={"":0})
import requests
from PIL import Image
prompt = "An image of"
url = "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/snowman.png"
image = Image.open(requests.get(url, stream=True).raw)
image
inputs = processor(text=prompt, images=image, return_tensors="pt").to("cuda:0")
# autoregressively generate completion
generated_ids = model.generate(**inputs, max_new_tokens=128)
# convert generated token IDs back to strings
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(generated_text)
# By default, the generated text is cleaned up and the entities are extracted.
processed_text, entities = processor.post_process_generation(generated_text)
print(processed_text)
print(entities)
gives
An image of a snowman warming up by a fire.
[]
Expected behavior
needs to give entities https://github.com/NielsRogge/Transformers-Tutorials/blob/master/KOSMOS-2/Inference_with_KOSMOS_2_for_multimodal_grounding.ipynb
Hi @andysingal. To use Kosmos-2 for image grounding, you have to add a special <grounding> token before the prompt, as they do in the paper. Also you can use <phrase> token to get bboxes of specific phrases in the format prev_prompt_text <phrase>the_object</phrase>
import requests
from PIL import Image
processor = AutoProcessor.from_pretrained("microsoft/kosmos-2-patch14-224")
model = AutoModelForVision2Seq.from_pretrained("microsoft/kosmos-2-patch14-224", device_map="cpu")
prompt_grounded = "<grounding>An image of"
prompt_refer = "An image of <phrase>a snowman</phrase>"
url = "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/snowman.png"
image = Image.open(requests.get(url, stream=True).raw)
image
inputs = processor(text=prompt_refer, images=image, return_tensors="pt").to(model.device)
# autoregressively generate completion
generated_ids = model.generate(**inputs, max_new_tokens=100)
input_len = inputs['input_ids'].shape[-1]
# convert generated token IDs back to strings
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
# By default, the generated text is cleaned up and the entities are extracted.
processed_text, entities = processor.post_process_generation(generated_text)
print(processed_text)
print(entities)
>>> An image of a snowman warming himself by a campfire
>>> [('a snowman', (12, 21), [(0.390625, 0.046875, 0.984375, 0.828125)]), ('a campfire', (41, 51), [(0.109375, 0.640625, 0.546875, 0.984375)])]
Thank you very much
On Wed, Jan 17, 2024 at 3:08 PM Raushan Turganbay @.***> wrote:
Hi @andysingal https://github.com/andysingal. To use Kosmos-2 for image grounding, you have to add a special
token before the prompt, as they do in the paper https://arxiv.org/abs/2306.14824. Also you can use token to get bboxes of specific phrases in the format prev_prompt_text the_object import requests from PIL import Image
processor = AutoProcessor.from_pretrained("microsoft/kosmos-2-patch14-224") model = AutoModelForVision2Seq.from_pretrained("microsoft/kosmos-2-patch14-224", device_map="cpu")
prompt_grounded = "
An image of" prompt_refer = "An image of a snowman "url = "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/snowman.png" image = Image.open(requests.get(url, stream=True).raw) image
inputs = processor(text=prompt_refer, images=image, return_tensors="pt").to(model.device)
autoregressively generate completion
generated_ids = model.generate(**inputs, max_new_tokens=100) input_len = inputs['input_ids'].shape[-1]
convert generated token IDs back to strings
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
By default, the generated text is cleaned up and the entities are extracted.
processed_text, entities = processor.post_process_generation(generated_text)
print(processed_text) print(entities)
An image of a snowman warming himself by a campfire [('a snowman', (12, 21), [(0.390625, 0.046875, 0.984375, 0.828125)]), ('a campfire', (41, 51), [(0.109375, 0.640625, 0.546875, 0.984375)])]
— Reply to this email directly, view it on GitHub https://github.com/huggingface/transformers/issues/28518#issuecomment-1895434657, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE4LJNNJZCFNQ6UKHCMAE3DYO6LYHAVCNFSM6AAAAABB4ENBYKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJVGQZTINRVG4 . You are receiving this because you were mentioned.Message ID: @.***>
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.