gigachain icon indicating copy to clipboard operation
gigachain copied to clipboard

BooleanOutputParser error when non-english prompt is used

Open labdmitriy opened this issue 9 months ago • 0 comments

There is an LLMChainFilter in gigachain legacy API which can be used as additional filter for the chunks after they were retrieved from the vectorstore.

In the original version (langchain) this LLMChainFilter uses the prompt in English, including possible output values (YES/NO), but in gigachain russian prompt is used with different output values (ДА/НЕТ).

For chain output parsing BooleanOutputParser is used and it has YES and NO as predefined values (defined as class variables).

So if we try to use this filter as-is in the gigachain chain, then there will be the following error as an example: ValueError: BooleanOutputParser expected output value to include either YES or NO. Received ДА.

The code to reproduce the issue:

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainFilter
from langchain_community.chat_models.gigachat import GigaChat
from langchain_community.document_loaders import TextLoader
from langchain_community.embeddings.gigachat import GigaChatEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_text_splitters import CharacterTextSplitter

from gigachat_issues.config.settings import DefaultSettings

settings = DefaultSettings()
DATA_PATH = settings.project_path / "gigachat_issues/boolean_parser_prompt/data"


def pretty_print_docs(docs):
    print(
        f"\n{'-' * 100}\n".join(
            [f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]
        )
    )


documents = TextLoader(DATA_PATH / "state_of_the_union.txt").load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

embeddings = GigaChatEmbeddings(
    base_url=settings.gigachat_api_base_url,
    credentials=settings.gigachat_api_credentials,
    scope=settings.gigachat_api_scope,
    verify_ssl_certs=False,
    one_by_one_mode=True,
)
retriever = FAISS.from_documents(texts, embeddings).as_retriever()

model = GigaChat(
    base_url=settings.gigachat_api_base_url,
    model="GigaChat-Pro",
    credentials=settings.gigachat_api_credentials,
    scope=settings.gigachat_api_scope,
    temperature=1e-15,
    profanity_check=False,
    verbose=False,
    timeout=600,
    verify_ssl_certs=False,
    streaming=True,
)
filter_chain = LLMChainFilter.from_llm(model)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=filter_chain, base_retriever=retriever
)

compressed_docs = compression_retriever.invoke(
    "What did the president say about Ketanji Jackson Brown"
)
pretty_print_docs(compressed_docs)

labdmitriy avatar May 20 '24 07:05 labdmitriy