transformers Blip2ForConditionalGeneration.from_pretrained is limited by 100% CPU usability (on one single core)

System Info

transformers version: 4.27.0.dev0
Platform: Linux-5.19.0-31-generic-x86_64-with-glibc2.36
Python version: 3.10.6
Huggingface_hub version: 0.12.0
PyTorch version (GPU?): 2.0.0.dev20230209+cu118 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

No response

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Run this code on a computer with stron GPU and strong CPU:

import torch
import requests
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration
import torch

device = "cuda"

processor = Blip2Processor.from_pretrained("Salesforce/blip2-flan-t5-xxl", load_in_8bit=True, device_map="auto")
with torch.device("cuda"):
    model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-flan-t5-xxl", load_in_8bit=True, device_map="auto")

for i in range(1, 923):
    raw_image = Image.open('UIDimgs/' + str(i) + '.jpg').convert('RGB')

    inputs = processor(raw_image,  return_tensors="pt").to(device, torch.float16)

    out = model.generate(**inputs, max_length=64, min_length=20)
    print(i,': ',processor.decode(out[0], skip_special_tokens=True))

Expected behavior

Hello! When running the above code the usability of my RTX 4090 is only around 30%. My CPU usability is all the time limited with 100%. Unfortunately Python here only uses one single core of my AMD 5900X (12+12 cores). Can anyone see an error in my code? How can I bring the code to use more than only one single CPU core?

Mar 08 '23 01:03 Marcophono2

cc @younesbelkada

Mar 08 '23 02:03 sgugger

Hello @Marcophono2 Thanks for the issue, can you try:

import torch
import requests
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration
import torch

device = "cuda"

processor = Blip2Processor.from_pretrained("Salesforce/blip2-flan-t5-xxl")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-flan-t5-xxl", load_in_8bit=True, device_map="auto")

print(model.hf_device_map)

for i in range(1, 923):
    raw_image = Image.open('UIDimgs/' + str(i) + '.jpg').convert('RGB')

    inputs = processor(raw_image,  return_tensors="pt").to(device, torch.float16)

    out = model.generate(**inputs, max_length=64, min_length=20)
    print(i,': ',processor.decode(out[0], skip_special_tokens=True))

And let me know what you get for print(model.hf_device_map)?

Mar 08 '23 08:03 younesbelkada

Thank you, @younesbelkada !The result I get is

{'': 0}

Mar 08 '23 19:03 Marcophono2

This is a bit strange @Marcophono2 ,

{'': 0} indicates that the entire model is on the GPU device. Can you confirm with us the GPU VRAM of your gpu? Also I would replace:

with torch.device("cuda"):
    model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-flan-t5-xxl", load_in_8bit=True, device_map="auto")

With:

model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-flan-t5-xxl", load_in_8bit=True, device_map="auto")

Also make sure to use the latest accelerate and bitsandbytes versions:

pip install --upgrade accelerate bitsandbytes

Mar 10 '23 12:03 younesbelkada

Yes, that is correct, @younesbelkada , the entire model is in the VRAM (RTX 4090). There is not much space left but it's matching. ;) Before I tried without

with torch.device("cuda"):

I updated accelerate from 0.16 to 0.17 (bitsandbytes was up to date) but no difference. Meanwhile I am not sure anymore if this 100% cpu usage is really a "limit". When I analyse how the load is split up then I can see that sometimes 2 cores are working. One with 40%, the other with 61% (as an example). Then it would be just an accident. But what would then be the bottleneck that my GPU usability is never > 32%?

Mar 10 '23 13:03 Marcophono2

It seems that the model loading in 8 bit is the reason for the 100% cpu (one core/thread) limitation. I replaced the code now with

model3 = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b", torch_dtype=torch.float16).to("cuda")

and the cpu can use up to 200% which the gpu usage is at 60%. Still not perfect but double performance. But I do not want to use the 2.7 model. :-) I want to use the blip2-flan-t5-xxl model which is too large for my VRAM as long as I do not use the 8 bit version. Has anyone an idea how I can activate also the other cpu cores when using 8 bit?

Mar 13 '23 18:03 Marcophono2

Sorry @ArthurZucker , but as you seem to be very near at the core, may be you have an idea for this issue I posted last week, too?

Mar 14 '23 15:03 Marcophono2

Hey, I think setting devic_map = "auto" should help balancing the load when using the flan-t5-xxl model to both CPU and GPU. This should allow you to run on both. You need accelerate library for this to work! Would that fix your issue?

Mar 15 '23 12:03 ArthurZucker

Nope, @ArthurZucker . I already have device_map = "auto" included in my code. Or do you mean to implement it anywhere else too? Also accelerate is installed.

Mar 15 '23 12:03 Marcophono2

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Apr 08 '23 15:04 github-actions[bot]

@sgugger @Marcophono2 is there any way I can run Blip2ForConditionalGeneration on M1 Pro Chip without GPU? Just for testing purpose

May 03 '24 10:05 anupswarnkar

It's unrelated to this issue, but setting device_map = "auto" should suffice

May 23 '24 07:05 ArthurZucker

transformers transformers copied to clipboard

Blip2ForConditionalGeneration.from_pretrained is limited by 100% CPU usability (on one single core)

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

transformers
transformers copied to clipboard