[Question]: Running LLMLingua with GGUF models

Open 92dev opened this issue 1 year ago • 1 comments

Describe the issue

Hi, trying to make it run properly with GGUF models (i.e. CPU only) due to RAM restriction, Trying to use it as

compressor = PromptCompressor(
    device_map="cpu",
    model_name="TheBloke/Llama-2-7B-GGUF",
    model_config={ 'model_file': "llama-2-7b.Q4_K_M.gguf", 'model_type': "llama", 'gpu_layers': 0 }
)

but need to somehow push some code for using llama-cpp so I can load it properly (otherwise stops at tokenizer) anyone already done this ? is it planned to be supported ? or would any have an advice on how to proceed

Mar 06 '24 15:03 92dev

Hi @92dev, currently, @Technotech is assisting in making llama-cpp support LLMLingua. You can find more details at https://github.com/abetlen/llama-cpp-python/issues/1065, #41.

Mar 07 '24 08:03 iofu728