llama-cpp-python Is there support for loading a sharded gguf file ?

Is your feature request related to a problem? Please describe. Inquiring whether this project supports loading a "sharded" gguf model file ? The llama cpp project appears to add tooling for splitting gguf files into pieces (more here). Was curious of the this project supports loading gguf files in that format since I didn't see any mention of it in the documentation or issues.

If it is supported, could you point me to the documentation on this or provide a code example ? If not, perhaps this feature could be added ?

Apr 12 '24 20:04 jharonfe

@jharonfe I haven't tested it personally but according to the linked discussion it's automatically detected by llama_load_model_from_file which llama-cpp-python uses.

One caveat is that this probably doesn't work with .from_pretrained yet because that method looks for a single file to pull via the huggingface_hub library. I think adding an option like additional_files there would be good, I'll look into it.

One thing that would help is a link to a small model that's been split and uploaded, preferrably <7b.

Apr 13 '24 01:04 abetlen

I tested this using the latest version, 0.2.61, and the model appears to load correctly. Thanks for the feedback on this.

Apr 16 '24 20:04 jharonfe

I just hit this issue today. I had tried using a wild card to specify all of the files, but then had it complain about seeing multiple files. An option like additional_files would be a nice quality of life change.

Apr 18 '24 13:04 ryao

Prototype: https://github.com/abetlen/llama-cpp-python/commit/0e67a832bdda3d7e73cbcb162e93d5455503ca57 Will test and open PR if it works.

May 11 '24 20:05 Gnurro

PR opened: https://github.com/abetlen/llama-cpp-python/pull/1457

May 14 '24 14:05 Gnurro

i would appreciate this feature! currently, i do something very hacky like below and it works (once all files are downloaded), but it's pretty clunky

from llama_cpp import Llama

for i in range(1, 6+1):  # assuming the shards are split into 6
    try:
        llm = Llama.from_pretrained(
            repo_id="MaziyarPanahi/Llama-3-70B-Instruct-DPO-v0.4-GGUF",
            filename=f"*IQ4_XS-{i:05d}*",
            verbose=True,
            n_gpu_layers=-1, 
        )
    except:
        pass

Jun 07 '24 20:06 ozanciga

i would appreciate this feature! currently, i do something very hacky like below and it works (once all files are downloaded), but it's pretty clunky

Well, my PR has been sitting there for a month now. You might try editing the code additions in yourself to get rid of the jank.

Jun 19 '24 21:06 Gnurro