text-generation-webui icon indicating copy to clipboard operation
text-generation-webui copied to clipboard

[4bit LLaMA] Safetensors & direct-to-GPU loading

Open Ronsor opened this issue 1 year ago • 4 comments

Unlike 8-bit LLaMA, it seems that the slow torch.load function is used to load the entire model to CPU RAM before sending to VRAM. While I'm not concerned about memory usage, it does cause slower start times.

It'd be great if we could convert the 4bit .pt files to safetensors and load them directly to VRAM to avoid the slowdown.

Ronsor avatar Mar 13 '23 18:03 Ronsor

Related to #177

oobabooga avatar Mar 13 '23 18:03 oobabooga

I have 2x 4090, but only 64GB system ram. Would this let me load 65B Int4 across both devices? I understand I would require 128GB for that.

dnhkng avatar Mar 14 '23 15:03 dnhkng

@dnhkng adding this to the remaining parameters

--gpu-memory 20 20

should work. I don't know how fast it would be because I have only one GPU. The two numbers must be the same for now.

oobabooga avatar Mar 14 '23 20:03 oobabooga

That loads, yay! But leads to the error described at #324

dnhkng avatar Mar 14 '23 21:03 dnhkng

Safetensors 4-bit support has been added recently.

oobabooga avatar Mar 29 '23 03:03 oobabooga