bitsandbytes
bitsandbytes copied to clipboard
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasGemmEx
System Info
I am using cuda_12.2, torch 2.1.0a0+29c30b1, bitsandbytes 0.43.3, python 3.10 Driver Version: 535.113.01 NVIDIA GeForce RTX 2080 Ti
Reproduction
import gc
import torch
from diffusers import LattePipeline
from transformers import T5EncoderModel, BitsAndBytesConfig
import imageio
from torchvision.utils import save_image
torch.manual_seed(0)
def flush():
gc.collect()
torch.cuda.empty_cache()
def bytes_to_giga_bytes(bytes):
return bytes / 1024 / 1024 / 1024
video_length = 16
model_id = "maxin-cn/Latte-1"
text_encoder = T5EncoderModel.from_pretrained(
model_id,
subfolder="text_encoder",
quantization_config=BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16),
device_map="auto",
cache_dir="/data/"
)
pipe = LattePipeline.from_pretrained(
model_id,
text_encoder=text_encoder,
transformer=None,
device_map="balanced",
cache_dir="/data/"
)
with torch.no_grad():
prompt = "a cat wearing sunglasses and working as a lifeguard at pool."
negative_prompt = ""
prompt_embeds, negative_prompt_embeds = pipe.encode_prompt(prompt, negative_prompt=negative_prompt)
del text_encoder
del pipe
flush()
pipe = LattePipeline.from_pretrained(
model_id,
text_encoder=None,
torch_dtype=torch.float16,
cache_dir="/data/",
).to("cuda")
# pipe.enable_vae_tiling()
# pipe.enable_vae_slicing()
videos = pipe(
video_length=video_length,
num_inference_steps=50,
negative_prompt=None,
prompt_embeds=prompt_embeds,
negative_prompt_embeds=negative_prompt_embeds,
output_type="pt",
).frames.cpu()
print(f"Max memory allocated: {bytes_to_giga_bytes(torch.cuda.max_memory_allocated())} GB")
if video_length > 1:
videos = (videos.clamp(0, 1) * 255).to(dtype=torch.uint8) # convert to uint8
imageio.mimwrite('./latte_output.mp4', videos[0].permute(0, 2, 3, 1), fps=8, quality=5) # highest quality is 10, lowest is 0
else:
save_image(videos[0], './latte_output.png')
https://github.com/Vchitect/Latte/issues/125#issue-2529714919
Expected behavior
https://huggingface.co/docs/bitsandbytes/v0.43.3/installation What is 4bit quantation GPU requirement?
Hi @LukeLIN-web, I was not able to reproduce this on an RTX 4090. That said, I would also expect it to work on a 2080 Ti, as that GPU is fully supported for 4bit quantization with bitsandbytes.
I suspect your stack trace is not giving the full picture, as we do not use cublasGemmEx in 4bit. This may come from a PyTorch operation. You may get a more clear trace by setting CUDA_LAUNCH_BLOCKING=1 in your environment.
Hi @LukeLIN-web,
Thank you for providing detailed information. Upon reviewing the details of the issue, it appears that the problem might not be directly related to the bitsandbytes library. I will be closing this issue for now. If you find that there is indeed a problem with the bitsandbytes library, feel free to reopen the issue with additional details.