LLMLingua How to setup LLMLingua with localhost?

Hello, how do I set up LLMLingua with a self-hosted localhost server? Is there a tutorial? Thanks.

### Tasks

Jan 10 '24 02:01 JiHa-Kim

Hi @JiHa-Kim,

Thank you for your support. I suggest referring to the code of the Hugging Face space demo as a reference. You can then build a self-hosted local server using Gradio.

Jan 11 '24 06:01 iofu728

How do you use GGUF format instead of GPTQ? Can you use with LM Studio to host? It would be very great to run inference shared on CPU+GPU.

Also, how do you get it to work with an AI API endpoint? I keep getting the error:

    compressed_prompt = llm_lingua.compress_prompt(
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Program Files\Lib\site-packages\llmlingua\prompt_compressor.py", line 143, in compress_prompt
    context_tokens_length = [self.get_token_length(c) for c in context]
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Program Files\Lib\site-packages\llmlingua\prompt_compressor.py", line 143, in <listcomp>
    context_tokens_length = [self.get_token_length(c) for c in context]
                             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Program Files\Lib\site-packages\llmlingua\prompt_compressor.py", line 254, in get_token_length
    self.tokenizer(text, add_special_tokens=add_special_tokens).input_ids
    ^^^^^^^^^^^^^^
AttributeError: 'OpenRouterPromptCompressor' object has no attribute 'tokenizer'

You can look at the code I tried to use in my GitHub repository...

Jan 11 '24 21:01 JiHa-Kim

Hi @JiHa-Kim, thank you for your help and efforts.

I haven't tried using GGUF with LLMLingua yet, but I believe there shouldn't be any major block issues. Also, a special thanks to @TechnotechGit, who is currently assisting in making Llama cpp compatible with LLMLingua. I'm confident this will facilitate support for models in GGUF format.

Regarding the second issue, it seems to stem from the lack of a defined tokenizer in OpenRouterPromptCompressor. You might try initializing a tokenizer using titoken. However, I suspect there might be some additional errors to address later on.

Jan 12 '24 09:01 iofu728

Thanks I am excited to get this working properly.

Jan 12 '24 11:01 JiHa-Kim

Well it seems like I managed to get the model loaded using llama-cpp-python with the new code in my repository but now I hit this error and I am stuck.

Traceback (most recent call last):
  File "C:\Users\Public\Coding\LLMLingua\LLMLingua_test1.py", line 48, in <module>
    compressed_prompt = llm_lingua.compress_prompt(
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Program Files\Lib\site-packages\llmlingua\local_prompt_compressor.py", line 230, in compress_prompt
    context = self.iterative_compress_prompt(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Program Files\Lib\site-packages\llmlingua\local_prompt_compressor.py", line 712, in iterative_compress_prompt
    loss, past_key_values = self.get_ppl(
                            ^^^^^^^^^^^^^
  File "C:\Program Files\Lib\site-packages\llmlingua\local_prompt_compressor.py", line 83, in get_ppl
    response = self.model(
               ^^^^^^^^^^^
TypeError: Llama.__call__() got an unexpected keyword argument 'attention_mask

Jan 13 '24 17:01 JiHa-Kim

Hi @JiHa-Kim, currently, calling the llama cpp model may not be supported, or it might require modifying the 'call' parameter in PromptCompressor.

Jan 15 '24 06:01 iofu728

LLMLingua LLMLingua copied to clipboard

How to setup LLMLingua with localhost?

LLMLingua
LLMLingua copied to clipboard