gigachain
gigachain copied to clipboard
BooleanOutputParser error when non-english prompt is used
There is an LLMChainFilter
in gigachain
legacy API which can be used as additional filter for the chunks after they were retrieved from the vectorstore.
In the original version (langchain
) this LLMChainFilter
uses the prompt in English, including possible output values (YES/NO), but in gigachain
russian prompt is used with different output values (ДА/НЕТ).
For chain output parsing BooleanOutputParser
is used and it has YES and NO as predefined values (defined as class variables).
So if we try to use this filter as-is in the gigachain
chain, then there will be the following error as an example:
ValueError: BooleanOutputParser expected output value to include either YES or NO. Received ДА.
The code to reproduce the issue:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainFilter
from langchain_community.chat_models.gigachat import GigaChat
from langchain_community.document_loaders import TextLoader
from langchain_community.embeddings.gigachat import GigaChatEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_text_splitters import CharacterTextSplitter
from gigachat_issues.config.settings import DefaultSettings
settings = DefaultSettings()
DATA_PATH = settings.project_path / "gigachat_issues/boolean_parser_prompt/data"
def pretty_print_docs(docs):
print(
f"\n{'-' * 100}\n".join(
[f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]
)
)
documents = TextLoader(DATA_PATH / "state_of_the_union.txt").load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)
embeddings = GigaChatEmbeddings(
base_url=settings.gigachat_api_base_url,
credentials=settings.gigachat_api_credentials,
scope=settings.gigachat_api_scope,
verify_ssl_certs=False,
one_by_one_mode=True,
)
retriever = FAISS.from_documents(texts, embeddings).as_retriever()
model = GigaChat(
base_url=settings.gigachat_api_base_url,
model="GigaChat-Pro",
credentials=settings.gigachat_api_credentials,
scope=settings.gigachat_api_scope,
temperature=1e-15,
profanity_check=False,
verbose=False,
timeout=600,
verify_ssl_certs=False,
streaming=True,
)
filter_chain = LLMChainFilter.from_llm(model)
compression_retriever = ContextualCompressionRetriever(
base_compressor=filter_chain, base_retriever=retriever
)
compressed_docs = compression_retriever.invoke(
"What did the president say about Ketanji Jackson Brown"
)
pretty_print_docs(compressed_docs)