transformers Missing parameter settings in BLIP 2

System Info

transformers version: 4.27.0.dev0
Platform: Linux-5.19.0-31-generic-x86_64-with-glibc2.36
Python version: 3.10.6
Huggingface_hub version: 0.12.0
PyTorch version (GPU?): 2.0.0.dev20230209+cu118 (True)
Tensorflow version (GPU?): 2.11.0 (True)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: yes
Using distributed or parallel set-up in script?: no

Who can help?

No response

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

The following code is working but can not process parameters like nucleus sampling, length penalty or temperature as provided in the original prokect from Salesforce. (to test out at https://huggingface.co/spaces/Salesforce/BLIP2)

from transformers import Blip2Processor,AutoProcessor, Blip2ForConditionalGeneration
processor3 = AutoProcessor.from_pretrained("Salesforce/blip2-flan-t5-xxl", load_in_8bit=True, device_map={'':torch.cuda.current_device()})

with torch.device("cuda"):
        model3 = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-flan-t5-xxl", load_in_8bit=True, device_map={'':torch.cuda.current_device()})
        raw_image = Image.open('UIDimgsages/x.jpg').convert('RGB')
        inputs = processor3(raw_image, return_tensors="pt").to(device, torch.float16)
        out = model3.generate(**inputs, max_length=64, min_length=12)
        blip2_output = processor3.decode(out[0], skip_special_tokens=True)
        print(blip2_output)

Expected behavior

It should be possible to adjust all parameters which are given in the original BLIP 2 project. @ArthurZucker @amyeroberts

Best regards Marc

Mar 14 '23 01:03 Marcophono2

Hey! Did you try playing with generation_config ? All the arguments that you are looking for can either be setup inside, or provided in the generate kwargs. Tempertature and penalty length are both availble 😉 not sure about nucleus sampling, but what you are looking for is probably here or here. Tell me if you can't find what you were looking for!

Mar 14 '23 10:03 ArthurZucker

@ArthurZucker , that sounds wonderful! I have no idea why I missed this at least a dozen times. :) I will try it out later today. Thank you very much!

Mar 14 '23 15:03 Marcophono2

It's pretty hard for us to debug if there's no error message being given. :(

Also, BLIP-2 should support all arguments of the generate method, and there's no need to use the with torch.device("cuda") context manager, as this might break the code. The device_map argument of the from_pretrained method will take care of placing everything on the appropriate device.

Refer to the example code snippets shown at the bottom of the model cards like this one on the hub regarding usage in 8bit.

Mar 14 '23 19:03 NielsRogge

Thank you, @NielsRogge , I tried the example code as the very first one but had the same problem. To confirm (myself) I tried it again. Without setting a minimum length the inference is so fast that I wouldn't mind about the performance issue. But when adding a minimum length then this bottleneck is really annoying. I can confirm that the cpu usage, while inference, is exactly 100% for the inference job. Either 1 cpu thread (from 24) is at 100% or one is at 66% and a second one at 33%. It is caused by the 8 bit setting. Can't anyone confirm (or refute) my observations?

Mar 14 '23 21:03 Marcophono2

Oh! I think I answered to the wrong comment or in the wrong thread. Or you answered in the wrong thread, @NielsRogge ! 😄 May be you refered to my other thread https://github.com/huggingface/transformers/issues/22011 ?

Mar 14 '23 21:03 Marcophono2

@Marcophono2 did you figure out nice settings to use? I also switched from BLIP codebase to using transformers version and the generated captions are not as good. There is a lot of repeating. I've tried with default, contrastive search, multinomial sampling, beam search, and diverse beam search and still haven't found settings that give consistent captions like the old BLIP library.

Mar 28 '23 02:03 pharmapsychotic

@pharmapsychotic Wow, Mr Clip-Interrogator! I love your tool and use it very often! Unfortunatelly I didn't find a solution for better control over BLIP2. Also I switched back from transformers to the native codebase since I realized that opt2.7b is working as good as the flan-t5-xxl (for me at least) and I am able to put it into my 4090 vram without needing a 8 bit conversion. The inference time is much shorter now, about 0.6 seconds, if using standard length. And now I have some more control over the settings, excluded length + senseful output. Meanwhile I think there is no really solution for it. The captions of the training sets are simply too small. The only thing I could imagine is to ask certain questions in a second step depending on the (short) standard output of BLIP2. Another "workaround" that I use meanwhile is to analyse an image additionally with CLIP related to pre-defined points of interest. Using feature extraction is a mighty tool for a lot of things here. For example to estimate the age of a person I use feature extraction + classification like

cls_namesA = ["age of 1 year","age of 2 years","age of 3 years","age of 4 years","age of 5 years","age of 6 years","age of 7 years","age of 8 years","age of 9 years","age of 10 years","age of 11 years","age of 12 years","age of 13 years","age of 14 years","age of 15 years","age of 16 years","age of 17 years","age of 18 years","age of 19 years","age of 20 years","age of 21 years","age of 22 years","age of 23 years","age of 24 years","age of 25 years","age of 26 years","age of 27 years","age of 28 years","age of 29 years","age of 30 years","age of 31 years","age of 32 years","age of 33 years","age of 34 years","age of 35 years","age of 36 years","age of 37 years","age of 38 years","age of 39 years","age of 40 years","age of 41 years","age of 42 years","age of 43 years","age of 44 years","age of 45 years","age of 46 years","age of 47 years","age of 48 years","age of 49 years","age of 50 years","age of 51 years","age of 52 years","age of 53 years","age of 54 years","age of 55 years","age of 56 years","age of 57 years","age of 58 years","age of 59 years","age of 60 years","age of 61 years","age of 62 years","age of 63 years","age of 64 years","age of 65 years","age of 66 years","age of 67 years","age of 68 years","age of 69 years","age of 70 years","age of 71 years","age of 72 years","age of 73 years","age of 74 years","age of 75 years","age of 76 years","age of 77 years","age of 78 years","age of 79 years","age of 80 years","age of 81 years","age of 82 years","age of 83 years","age of 84 years","age of 85 years","age of 86 years","age of 87 years","age of 88 years","age of 89 years","age of 90 years","age of 91 years","age of 92 years","age of 93 years","age of 94 years","age of 95 years","age of 96 years","age of 97 years","age of 98 years","age of 99 years","age of 100 years","age of 101 years","age of 102 years","age of 103 years"]

with a filtering and second classification in a second step.

That works extremly fast and well! Also for other points of interest. I found out that ViT-B-32 brings the best results.

modelC, vis_processors2, txt_processors2 = load_model_and_preprocess("clip_feature_extractor", model_type="ViT-B-32", is_eval=True, device=device)

Best regards Marc

Mar 28 '23 10:03 Marcophono2

Thanks for reporting, we are looking into why this is the case. cc @gante

Mar 30 '23 18:03 NielsRogge

Hi, I wonder how should I do if I would like to generate multiple captions for each image?

For example, we could use "use_nucleus_sampling" in Lavis version of BLIP2 to accomplish that, but I haven't found a way in hugging face version of BLIP2.

generated_text = model.generate( {"image": image}, use_nucleus_sampling=True, num_captions=20 )

Apr 08 '23 06:04 zilunzhang

Oh yes one reason why results weren't the same was because you might have used different generation settings. Note that if you do model.generate(**inputs), greedy decoding is used by default (which is the most simple form of generating text by taking the token with the highest probability at each time step).

To match the settings in the BLIP-2 repo, which uses beam search by default as seen here, you can do model.generate(**inputs, num_beams=5, max_new_tokens=30, repetition_penalty=1.0, length_penalty=1.0, temperature=1). To use nucleus sampling, you can do model.generate(**inputs, do_sample=True, top_p=0.9)

Apr 08 '23 08:04 NielsRogge

I've had really good success with BLIP2 since it came out a couple months ago, and now am rebuilding my notebooks on transformers. However, being new to transformers, it would be nice having num_captions natively available, as it is this feature that makes captioning powerful on my end.

Apr 27 '23 05:04 rodrigo-barraza

Hi @rodrigo-barraza this is supported, just pass in num_return_sequences as argument to the generate() method.

Apr 27 '23 12:04 NielsRogge

Hi @rodrigo-barraza this is supported, just pass in num_return_sequences as argument to the generate() method.

Oh wow, amazing. Not sure how I missed that. Thanks a bunch!

Apr 29 '23 03:04 rodrigo-barraza

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

May 24 '23 15:05 github-actions[bot]

transformers transformers copied to clipboard

Missing parameter settings in BLIP 2

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

transformers
transformers copied to clipboard