exllama
exllama copied to clipboard
Possible to load model with low system ram?
Hi,
I'm curious if it's possible to load a model if you don't have enough system ram, but enough vram.
I got 32gb of system ram and 48gb of vram, unfortunatly I'm not able to load a 65b model...
I get an error like: RuntimeError: unable to mmap 33484977464 bytes
Is there a way to avoid loading into system ram first? If not where would I need to look?
Adding another swap worked for me.
This is a limitation of the safetensors library. It insists on memory-mapping the input tensor file, which means that even though it isn't actually reading more than a little bit at once, it expects to be able to read the whole thing. So it looks like loading a .safetensors file larger than your system RAM just isn't possible right now. I'm going to look into options for allowing sharded model files.
Not really safetensors limitation in the format, more about internals of torch limiting mmaping.
More info here if you want to bypass those limitations. https://github.com/huggingface/safetensors/issues/373#issuecomment-1829513862
@Narsil Thanks for the technical details. If this is something you or @turboderp would like to support, maybe you could create an issue at the pytorch repo? I would do it, but I see that they already have +12k issues, so it would probably be overlooked without a precise technical description of what the issue is. It might be useful for other libraries to have this mechanism as well, if it's missing in pytorch.