exllama icon indicating copy to clipboard operation
exllama copied to clipboard

Possible to load model with low system ram?

Open gros87 opened this issue 2 years ago • 4 comments

Hi,

I'm curious if it's possible to load a model if you don't have enough system ram, but enough vram. I got 32gb of system ram and 48gb of vram, unfortunatly I'm not able to load a 65b model... I get an error like: RuntimeError: unable to mmap 33484977464 bytes

Is there a way to avoid loading into system ram first? If not where would I need to look?

gros87 avatar Aug 12 '23 11:08 gros87

Adding another swap worked for me.

Empor-co avatar Aug 15 '23 09:08 Empor-co

This is a limitation of the safetensors library. It insists on memory-mapping the input tensor file, which means that even though it isn't actually reading more than a little bit at once, it expects to be able to read the whole thing. So it looks like loading a .safetensors file larger than your system RAM just isn't possible right now. I'm going to look into options for allowing sharded model files.

turboderp avatar Aug 15 '23 09:08 turboderp

Not really safetensors limitation in the format, more about internals of torch limiting mmaping.

More info here if you want to bypass those limitations. https://github.com/huggingface/safetensors/issues/373#issuecomment-1829513862

Narsil avatar Nov 28 '23 13:11 Narsil

@Narsil Thanks for the technical details. If this is something you or @turboderp would like to support, maybe you could create an issue at the pytorch repo? I would do it, but I see that they already have +12k issues, so it would probably be overlooked without a precise technical description of what the issue is. It might be useful for other libraries to have this mechanism as well, if it's missing in pytorch.

erikschul avatar Nov 28 '23 14:11 erikschul