text-generation-webui [4bit LLaMA] Safetensors & direct-to-GPU loading

[4bit LLaMA] Safetensors & direct-to-GPU loading

Open Ronsor opened this issue 1 year ago • 4 comments

Unlike 8-bit LLaMA, it seems that the slow torch.load function is used to load the entire model to CPU RAM before sending to VRAM. While I'm not concerned about memory usage, it does cause slower start times.

It'd be great if we could convert the 4bit .pt files to safetensors and load them directly to VRAM to avoid the slowdown.

Mar 13 '23 18:03 Ronsor

Related to #177

Mar 13 '23 18:03 oobabooga

I have 2x 4090, but only 64GB system ram. Would this let me load 65B Int4 across both devices? I understand I would require 128GB for that.

Mar 14 '23 15:03 dnhkng

@dnhkng adding this to the remaining parameters

--gpu-memory 20 20

should work. I don't know how fast it would be because I have only one GPU. The two numbers must be the same for now.

Mar 14 '23 20:03 oobabooga

That loads, yay! But leads to the error described at #324

Mar 14 '23 21:03 dnhkng

Safetensors 4-bit support has been added recently.

Mar 29 '23 03:03 oobabooga

text-generation-webui text-generation-webui copied to clipboard

[4bit LLaMA] Safetensors & direct-to-GPU loading

text-generation-webui
text-generation-webui copied to clipboard