text-generation-webui
text-generation-webui copied to clipboard
[4bit LLaMA] Safetensors & direct-to-GPU loading
Unlike 8-bit LLaMA, it seems that the slow torch.load
function is used to load the entire model to CPU RAM before sending to VRAM. While I'm not concerned about memory usage, it does cause slower start times.
It'd be great if we could convert the 4bit .pt files to safetensors and load them directly to VRAM to avoid the slowdown.
Related to #177
I have 2x 4090, but only 64GB system ram. Would this let me load 65B Int4 across both devices? I understand I would require 128GB for that.
@dnhkng adding this to the remaining parameters
--gpu-memory 20 20
should work. I don't know how fast it would be because I have only one GPU. The two numbers must be the same for now.
That loads, yay! But leads to the error described at #324
Safetensors 4-bit support has been added recently.