LLMLingua
LLMLingua copied to clipboard
How to setup LLMLingua with localhost?
Hello, how do I set up LLMLingua with a self-hosted localhost server? Is there a tutorial? Thanks.
### Tasks
Hi @JiHa-Kim,
Thank you for your support. I suggest referring to the code of the Hugging Face space demo as a reference. You can then build a self-hosted local server using Gradio.
How do you use GGUF format instead of GPTQ? Can you use with LM Studio to host? It would be very great to run inference shared on CPU+GPU.
Also, how do you get it to work with an AI API endpoint? I keep getting the error:
compressed_prompt = llm_lingua.compress_prompt(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Program Files\Lib\site-packages\llmlingua\prompt_compressor.py", line 143, in compress_prompt
context_tokens_length = [self.get_token_length(c) for c in context]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Program Files\Lib\site-packages\llmlingua\prompt_compressor.py", line 143, in <listcomp>
context_tokens_length = [self.get_token_length(c) for c in context]
^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Program Files\Lib\site-packages\llmlingua\prompt_compressor.py", line 254, in get_token_length
self.tokenizer(text, add_special_tokens=add_special_tokens).input_ids
^^^^^^^^^^^^^^
AttributeError: 'OpenRouterPromptCompressor' object has no attribute 'tokenizer'
You can look at the code I tried to use in my GitHub repository...
Hi @JiHa-Kim, thank you for your help and efforts.
I haven't tried using GGUF with LLMLingua yet, but I believe there shouldn't be any major block issues. Also, a special thanks to @TechnotechGit, who is currently assisting in making Llama cpp compatible with LLMLingua. I'm confident this will facilitate support for models in GGUF format.
Regarding the second issue, it seems to stem from the lack of a defined tokenizer in OpenRouterPromptCompressor
. You might try initializing a tokenizer using titoken
. However, I suspect there might be some additional errors to address later on.
Thanks I am excited to get this working properly.
Well it seems like I managed to get the model loaded using llama-cpp-python with the new code in my repository but now I hit this error and I am stuck.
Traceback (most recent call last):
File "C:\Users\Public\Coding\LLMLingua\LLMLingua_test1.py", line 48, in <module>
compressed_prompt = llm_lingua.compress_prompt(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Program Files\Lib\site-packages\llmlingua\local_prompt_compressor.py", line 230, in compress_prompt
context = self.iterative_compress_prompt(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Program Files\Lib\site-packages\llmlingua\local_prompt_compressor.py", line 712, in iterative_compress_prompt
loss, past_key_values = self.get_ppl(
^^^^^^^^^^^^^
File "C:\Program Files\Lib\site-packages\llmlingua\local_prompt_compressor.py", line 83, in get_ppl
response = self.model(
^^^^^^^^^^^
TypeError: Llama.__call__() got an unexpected keyword argument 'attention_mask
Hi @JiHa-Kim, currently, calling the llama cpp model may not be supported, or it might require modifying the 'call' parameter in PromptCompressor.