LLMLingua [Question]: LLMLingua requires too much GPU memory, and it takes a lot of time to compress long text, such as 16k, etc. How to make it and LLM work at the same time

Describe the bug

LLMLingua requires too much video memory, and it takes a lot of time to compress long text, such as 16k, etc. How to make it and LLM work at the same time

Steps to reproduce

No response

Expected Behavior

No response

Logs

No response

Additional Information

No response

May 09 '24 03:05 dingjingzhen

Hi @dingjingzhen, thanks for supporting LLMLingua. Could you provide more details about how you are using it and your environment?

The LLMLingua series relies on a smaller model, such as BERT-level or llama-7b, to act as a compressor, which offers low overhead compared to larger models like GPT-4. To achieve low latency, it is recommended to use it on a GPU similar to the V100.

May 10 '24 08:05 iofu728

Hi @dingjingzhen, thanks for supporting LLMLingua. Could you provide more details about how you are using it and your environment?

The LLMLingua series relies on a smaller model, such as BERT-level or llama-7b, to act as a compressor, which offers low overhead compared to larger models like GPT-4. To achieve low latency, it is recommended to use it on a GPU similar to the V100.

Since my requirement is to summarize according to fixed text, I can compress offline in advance. In this case, which model has the better effect of compression, without considering delay and gpu memory。qwen1.5 32B，16K

May 14 '24 03:05 dingjingzhen