LLMLingua icon indicating copy to clipboard operation
LLMLingua copied to clipboard

How to setup LLMLingua with localhost?

Open JiHa-Kim opened this issue 1 year ago • 6 comments

Hello, how do I set up LLMLingua with a self-hosted localhost server? Is there a tutorial? Thanks.

### Tasks

JiHa-Kim avatar Jan 10 '24 02:01 JiHa-Kim

Hi @JiHa-Kim,

Thank you for your support. I suggest referring to the code of the Hugging Face space demo as a reference. You can then build a self-hosted local server using Gradio.

iofu728 avatar Jan 11 '24 06:01 iofu728

How do you use GGUF format instead of GPTQ? Can you use with LM Studio to host? It would be very great to run inference shared on CPU+GPU.

Also, how do you get it to work with an AI API endpoint? I keep getting the error:

    compressed_prompt = llm_lingua.compress_prompt(
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Program Files\Lib\site-packages\llmlingua\prompt_compressor.py", line 143, in compress_prompt
    context_tokens_length = [self.get_token_length(c) for c in context]
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Program Files\Lib\site-packages\llmlingua\prompt_compressor.py", line 143, in <listcomp>
    context_tokens_length = [self.get_token_length(c) for c in context]
                             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Program Files\Lib\site-packages\llmlingua\prompt_compressor.py", line 254, in get_token_length
    self.tokenizer(text, add_special_tokens=add_special_tokens).input_ids
    ^^^^^^^^^^^^^^
AttributeError: 'OpenRouterPromptCompressor' object has no attribute 'tokenizer'

You can look at the code I tried to use in my GitHub repository...

JiHa-Kim avatar Jan 11 '24 21:01 JiHa-Kim

Hi @JiHa-Kim, thank you for your help and efforts.

I haven't tried using GGUF with LLMLingua yet, but I believe there shouldn't be any major block issues. Also, a special thanks to @TechnotechGit, who is currently assisting in making Llama cpp compatible with LLMLingua. I'm confident this will facilitate support for models in GGUF format.

Regarding the second issue, it seems to stem from the lack of a defined tokenizer in OpenRouterPromptCompressor. You might try initializing a tokenizer using titoken. However, I suspect there might be some additional errors to address later on.

iofu728 avatar Jan 12 '24 09:01 iofu728

Thanks I am excited to get this working properly.

JiHa-Kim avatar Jan 12 '24 11:01 JiHa-Kim

Well it seems like I managed to get the model loaded using llama-cpp-python with the new code in my repository but now I hit this error and I am stuck.

Traceback (most recent call last):
  File "C:\Users\Public\Coding\LLMLingua\LLMLingua_test1.py", line 48, in <module>
    compressed_prompt = llm_lingua.compress_prompt(
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Program Files\Lib\site-packages\llmlingua\local_prompt_compressor.py", line 230, in compress_prompt
    context = self.iterative_compress_prompt(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Program Files\Lib\site-packages\llmlingua\local_prompt_compressor.py", line 712, in iterative_compress_prompt
    loss, past_key_values = self.get_ppl(
                            ^^^^^^^^^^^^^
  File "C:\Program Files\Lib\site-packages\llmlingua\local_prompt_compressor.py", line 83, in get_ppl
    response = self.model(
               ^^^^^^^^^^^
TypeError: Llama.__call__() got an unexpected keyword argument 'attention_mask

JiHa-Kim avatar Jan 13 '24 17:01 JiHa-Kim

Hi @JiHa-Kim, currently, calling the llama cpp model may not be supported, or it might require modifying the 'call' parameter in PromptCompressor.

iofu728 avatar Jan 15 '24 06:01 iofu728