[Bug]: Compression truncates words and sentences
Describe the bug
I used the code in the README and also in the notebook. Check the code below.
Steps to reproduce
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
documents = TextLoader(
encoding="utf-8"
"./docs/long_legal_text.txt",
).load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
texts = text_splitter.split_documents(documents)
hybrid_db = Neo4jVector.from_documents(
texts,
embeddings,
url=url,
username=username,
password=password,
search_type="hybrid",
pre_delete_collection=True,
index_name="index_name_llm_lingua",
keyword_index_name="keyword_name_llm_lingua"
)
retriever = hybrid_db.as_retriever(search_kwargs={'k': 8})
query = "une société minière CHINAHCC qui avait obtenu le statut A en 2020 et ayant realisé un benefice net de 1.5 million d'euros en 2023, souhaite savoir combien d'impôts elle va payer en 2023 ?"
docs = retriever.get_relevant_documents(query)
pretty_print_docs(docs) # I get a list of docs with the right answer (but it's tricky because there are other tax rates that do not apply to the company's situation
## Compression code starts here...
from langchain.retrievers import ContextualCompressionRetriever
from langchain_community.document_compressors import LLMLinguaCompressor
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(temperature=0)
compressor = LLMLinguaCompressor(model_name="openai-community/gpt2", device_map="cpu")
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor, base_retriever=retriever
)
compressed_docs = compression_retriever.get_relevant_documents(
query
)
pretty_print_docs(compressed_docs) ## I get weird characters and even truncated words/sentences...
I get this for example:
-20%, tit�erc ouvert à comp duvier 20.- taux spc de 15%é aux soci installesélérationrielle » et à celles ayant le stat A » esté comme :-,25 tit�ercuvert à janvier 2023 ; 17,50%, au titre de l�exercice ouvert à du 2024 ; -,%, aure de l’ice ou àtervier 2025#2>
Expected Behavior
My original retriever Neo4j does retrieve the data in utf-8 (especially that I use the French language), but after compression, it's a mess, unfortunately...
For example, I get this after compression: à comp duvier 20. (meaningless) which is originally à compter du 1 janvier 2024
Logs
No response
Additional Information
LLMLingua Version: 0.1.6 Operating System: WSL2 (in DOcker) Python Version:
Hi @younes-io, thank you for your support and the detailed issue information.
Although prompts compressed by LLMLingua might have garbled text and be hard for humans to understand, I acknowledge that "à comp duvier 20." indeed lost crucial information.
However, I suspect this might be due to the weaker semantic capability of GPT-2. You might consider using LLaMA or another SLM as a compressor, like compressor = LLMLinguaCompressor(model_name="NousResearch/Llama-2-7b-hf", device_map="cpu").
Hi @younes-io, thank you for your support and the detailed issue information.
Although prompts compressed by LLMLingua might have garbled text and be hard for humans to understand, I acknowledge that "à comp duvier 20." indeed lost crucial information.
However, I suspect this might be due to the weaker semantic capability of GPT-2. You might consider using LLaMA or another SLM as a compressor, like
compressor = LLMLinguaCompressor(model_name="NousResearch/Llama-2-7b-hf", device_map="cpu").
@iofu728 : Thanks for the feedback. I tried this "NousResearch/Llama-2-7b-hf" but it's "heavy" for my testing purposes.. I'll have to allocate more resources.. anything else I could use (more lightweight) ?
Hi @younes-io, maybe you can try "microsoft/phi-2" model.