exllama Possible to load model with low system ram?

Possible to load model with low system ram?

Open gros87 opened this issue 2 years ago • 4 comments

Hi,

I'm curious if it's possible to load a model if you don't have enough system ram, but enough vram. I got 32gb of system ram and 48gb of vram, unfortunatly I'm not able to load a 65b model... I get an error like: RuntimeError: unable to mmap 33484977464 bytes

Is there a way to avoid loading into system ram first? If not where would I need to look?

Aug 12 '23 11:08 gros87

Adding another swap worked for me.

Aug 15 '23 09:08 Empor-co

This is a limitation of the safetensors library. It insists on memory-mapping the input tensor file, which means that even though it isn't actually reading more than a little bit at once, it expects to be able to read the whole thing. So it looks like loading a .safetensors file larger than your system RAM just isn't possible right now. I'm going to look into options for allowing sharded model files.

Aug 15 '23 09:08 turboderp

Not really safetensors limitation in the format, more about internals of torch limiting mmaping.

More info here if you want to bypass those limitations. https://github.com/huggingface/safetensors/issues/373#issuecomment-1829513862

Nov 28 '23 13:11 Narsil

@Narsil Thanks for the technical details. If this is something you or @turboderp would like to support, maybe you could create an issue at the pytorch repo? I would do it, but I see that they already have +12k issues, so it would probably be overlooked without a precise technical description of what the issue is. It might be useful for other libraries to have this mechanism as well, if it's missing in pytorch.

Nov 28 '23 14:11 erikschul

exllama exllama copied to clipboard

Possible to load model with low system ram?

exllama
exllama copied to clipboard