transformers
transformers copied to clipboard
Blip2ForConditionalGeneration.from_pretrained is limited by 100% CPU usability (on one single core)
System Info
-
transformers
version: 4.27.0.dev0 - Platform: Linux-5.19.0-31-generic-x86_64-with-glibc2.36
- Python version: 3.10.6
- Huggingface_hub version: 0.12.0
- PyTorch version (GPU?): 2.0.0.dev20230209+cu118 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?:
- Using distributed or parallel set-up in script?:
Who can help?
No response
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
Run this code on a computer with stron GPU and strong CPU:
import torch
import requests
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration
import torch
device = "cuda"
processor = Blip2Processor.from_pretrained("Salesforce/blip2-flan-t5-xxl", load_in_8bit=True, device_map="auto")
with torch.device("cuda"):
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-flan-t5-xxl", load_in_8bit=True, device_map="auto")
for i in range(1, 923):
raw_image = Image.open('UIDimgs/' + str(i) + '.jpg').convert('RGB')
inputs = processor(raw_image, return_tensors="pt").to(device, torch.float16)
out = model.generate(**inputs, max_length=64, min_length=20)
print(i,': ',processor.decode(out[0], skip_special_tokens=True))
Expected behavior
Hello! When running the above code the usability of my RTX 4090 is only around 30%. My CPU usability is all the time limited with 100%. Unfortunately Python here only uses one single core of my AMD 5900X (12+12 cores). Can anyone see an error in my code? How can I bring the code to use more than only one single CPU core?
cc @younesbelkada
Hello @Marcophono2 Thanks for the issue, can you try:
import torch
import requests
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration
import torch
device = "cuda"
processor = Blip2Processor.from_pretrained("Salesforce/blip2-flan-t5-xxl")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-flan-t5-xxl", load_in_8bit=True, device_map="auto")
print(model.hf_device_map)
for i in range(1, 923):
raw_image = Image.open('UIDimgs/' + str(i) + '.jpg').convert('RGB')
inputs = processor(raw_image, return_tensors="pt").to(device, torch.float16)
out = model.generate(**inputs, max_length=64, min_length=20)
print(i,': ',processor.decode(out[0], skip_special_tokens=True))
And let me know what you get for print(model.hf_device_map)
?
Thank you, @younesbelkada !The result I get is
{'': 0}
This is a bit strange @Marcophono2 ,
{'': 0}
indicates that the entire model is on the GPU device. Can you confirm with us the GPU VRAM of your gpu?
Also I would replace:
with torch.device("cuda"):
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-flan-t5-xxl", load_in_8bit=True, device_map="auto")
With:
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-flan-t5-xxl", load_in_8bit=True, device_map="auto")
Also make sure to use the latest accelerate
and bitsandbytes
versions:
pip install --upgrade accelerate bitsandbytes
Yes, that is correct, @younesbelkada , the entire model is in the VRAM (RTX 4090). There is not much space left but it's matching. ;) Before I tried without
with torch.device("cuda"):
I updated accelerate from 0.16 to 0.17 (bitsandbytes was up to date) but no difference. Meanwhile I am not sure anymore if this 100% cpu usage is really a "limit". When I analyse how the load is split up then I can see that sometimes 2 cores are working. One with 40%, the other with 61% (as an example). Then it would be just an accident. But what would then be the bottleneck that my GPU usability is never > 32%?
It seems that the model loading in 8 bit is the reason for the 100% cpu (one core/thread) limitation. I replaced the code now with
model3 = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b", torch_dtype=torch.float16).to("cuda")
and the cpu can use up to 200% which the gpu usage is at 60%. Still not perfect but double performance. But I do not want to use the 2.7 model. :-) I want to use the blip2-flan-t5-xxl model which is too large for my VRAM as long as I do not use the 8 bit version. Has anyone an idea how I can activate also the other cpu cores when using 8 bit?
Sorry @ArthurZucker , but as you seem to be very near at the core, may be you have an idea for this issue I posted last week, too?
Hey, I think setting devic_map = "auto"
should help balancing the load when using the flan-t5-xxl
model to both CPU and GPU. This should allow you to run on both. You need accelerate
library for this to work! Would that fix your issue?
Nope, @ArthurZucker . I already have device_map = "auto" included in my code. Or do you mean to implement it anywhere else too? Also accelerate is installed.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
@sgugger @Marcophono2 is there any way I can run Blip2ForConditionalGeneration
on M1 Pro Chip
without GPU? Just for testing purpose
It's unrelated to this issue, but setting device_map = "auto"
should suffice