[Feature] add structured output to openai

Open bfdykstra opened this issue 1 year ago • 0 comments

Description

Adds a StructuredOutputChatOpenAI class to enable downstream applications to consume json

Simple example usage

import json
from kotaemon.llms import StructuredOutputChatOpenAI

class StructuredAnswer(BaseModel):
    answer: str

structured_llm = StructuredOutputChatOpenAI(
    base_url='https://api.openai.com/v1',
    model = 'gpt-4o-mini',
    temperature= 1,
    api_key = os.environ.get('OPENAI_API_KEY'),
    response_schema=StructuredAnswer
)

answer = await structured_llm.ainvoke('Hello how are you?')

print(json.loads(answer.content))
# -> {'answer': "I'm just a computer program, but I'm here and ready to help you! How can I assist you today?"}

Example usage in a retrieval pipeline

from kotaemon.storages.docstores import LanceDBDocumentStore
from kotaemon.storages.vectorstores import ChromaVectorStore
from kotaemon.embeddings.openai import OpenAIEmbeddings
from ktem.ktem.index.file.pipelines import DocumentRetrievalPipeline
from kotaemon.indices.qa.format_context import PrepareEvidencePipeline
from kotaemon.indices.qa.citation_qa import AnswerWithContextPipeline
from kotaemon.llms.chats.openai import StructuredOutputChatOpenAI, ChatOpenAI

from ktem.ktem.reasoning.simple import FullQAPipeline

from kotaemon.indices.rankings import LLMTrulensScoring

app_dir = "<path to your app data>/kotaemon/ktem_app_data/"
user_data_dir = app_dir + "user_data/"
doc_store_dir = user_data_dir + "docstore/"
doc_store = LanceDBDocumentStore(path = doc_store_dir, collection_name="index_1")

# vector store stuff
vector_store_dir = user_data_dir + "vectorstore"

vector_store = ChromaVectorStore(path = vector_store_dir, collection_name="index_1")

llm = ChatOpenAI(
    base_url='https://api.openai.com/v1',
    model = 'gpt-4o-mini',
    temperature= 0,
    api_key = os.environ.get('OPENAI_API_KEY'),
)
llm_scorer = LLMTrulensScoring( llm = llm )

#embeddings
embedding = OpenAIEmbeddings(
    base_url='https://api.openai.com/v1',
    model = 'text-embedding-ada-002',
    api_key=os.environ.get('OPENAI_API_KEY'),
    context_length=8191)


# document retrieval pipeline
document_retrieval = DocumentRetrievalPipeline(
    embedding = embedding,
    retrieval_mode = 'vector', # can be vector or text
    vector_store = vector_store,
    doc_store = doc_store,
    top_k=5,
    rerankers=[], #can provide rerankers
    llm_scorer = llm_scorer
    # rerankers = [cohere_reranking]
)

# pipeline that formats retrieved content
evidence_pipeline = PrepareEvidencePipeline()

class StructuredAnswer(BaseModel):
    answer: str

structured_llm = StructuredOutputChatOpenAI(
    base_url='https://api.openai.com/v1',
    model = 'gpt-4o-mini',
    temperature= 1,
    api_key = os.environ.get('OPENAI_API_KEY'),
    response_schema=StructuredAnswer
)

# answer questions with provided evidence
answer_pipeline = AnswerWithContextPipeline(
    llm=structured_llm,
    qa_template= (
            "Context: \n{context}\n\n"
            "{question}\n"
        )
)

qa_pipeline = FullQAPipeline(
    retrievers=[document_retrieval],
    evidence_pipeline=evidence_pipeline,
    answering_pipeline=answer_pipeline
)

prompt = 'This is a prompt'

# fetch relevant document ids and implement invoke method
answer, scored_docs = qa_pipeline.invoke(prompt, document_ids=[])
        
parsed_answer = json.loads(answer.content)

Type of change

[x] New features (non-breaking change).
[ ] Bug fix (non-breaking change).
[ ] Breaking change (fix or feature that would cause existing functionality not to work as expected).

Checklist

[x] I have performed a self-review of my code.
[ ] I have added thorough tests if it is a core feature.
[x] There is a reference to the original bug report and related work.
[x] I have commented on my code, particularly in hard-to-understand areas.
[x] The feature is well documented.

Jan 06 '25 18:01 bfdykstra