LLMLingua
LLMLingua copied to clipboard
[Question]: Running LLMLingua with GGUF models
Describe the issue
Hi, trying to make it run properly with GGUF models (i.e. CPU only) due to RAM restriction, Trying to use it as
compressor = PromptCompressor(
device_map="cpu",
model_name="TheBloke/Llama-2-7B-GGUF",
model_config={ 'model_file': "llama-2-7b.Q4_K_M.gguf", 'model_type': "llama", 'gpu_layers': 0 }
)
but need to somehow push some code for using llama-cpp so I can load it properly (otherwise stops at tokenizer)
anyone already done this ? is it planned to be supported ? or would any have an advice on how to proceed
Hi @92dev, currently, @Technotech is assisting in making llama-cpp support LLMLingua. You can find more details at https://github.com/abetlen/llama-cpp-python/issues/1065, #41.