phidata icon indicating copy to clipboard operation
phidata copied to clipboard

PDFImageReader not working with PDFKnowledgeBase

Open sridharaiyer opened this issue 9 months ago • 6 comments

Running the following code, gives this error:

from phi.assistant import Assistant
from phi.document.reader.pdf import PDFImageReader
from phi.knowledge.pdf import PDFKnowledgeBase
from phi.vectordb.lancedb.lancedb import LanceDb

# type: ignore
db_url = "/tmp/lancedb"  # Optional

# Create a knowledge base with the PDFs from the data/pdfs directory
knowledge_base = PDFKnowledgeBase(
    path="data/pdfs",
    vector_db=LanceDb(uri=db_url),
    reader=PDFImageReader(chunk=True),
)
# Load the knowledge base
knowledge_base.load(recreate=False)

# Create an assistant with the knowledge base
assistant = Assistant(
    knowledge_base=knowledge_base,
    add_references_to_prompt=True,
)

# Ask the assistant about the knowledge base
assistant.print_response("Summarize this document.", markdown=True)

Error -

INFO     Creating table: phi                                                    
Traceback (most recent call last):
  File "/Users/siyer/PycharmProjects/report-call-summarizer/localdb-lancedb-knowledgebase.py", line 10, in <module>
    knowledge_base = PDFKnowledgeBase(
                     ^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniconda/base/envs/report-call-summarizer/lib/python3.11/site-packages/pydantic/main.py", line 164, in __init__
    __pydantic_self__.__pydantic_validator__.validate_python(data, self_instance=__pydantic_self__)
pydantic_core._pydantic_core.ValidationError: 1 validation error for PDFKnowledgeBase
reader
  Input should be a valid dictionary or instance of PDFReader [type=model_type, input_value=PDFImageReader(chunk=True...\n\r', '\t', ' ', '  ']), input_type=PDFImageReader]
    For further information visit https://errors.pydantic.dev/2.5/v/model_type

Process finished with exit code 1

sridharaiyer avatar May 17 '24 03:05 sridharaiyer

this looks like a 1 line change in pdf.py; reader should of type "reader"

https://github.com/phidatahq/phidata/blob/9b2653f2c5ff77c4babf44ac324f5568ee69f856/phi/knowledge/pdf.py#L11

class PDFKnowledgeBase(AssistantKnowledge):
    path: Union[str, Path]
    reader: Reader = PDFReader()

jalotra avatar May 17 '24 18:05 jalotra

cool

datumradix avatar May 18 '24 08:05 datumradix

Hi Team. Do we have an update on when we can get a new release with this change?

sridharaiyer avatar May 20 '24 14:05 sridharaiyer

@sridharaiyer PR will be out shortly and most likely we will be releasing a new version by EOD

ysolanky avatar May 20 '24 14:05 ysolanky

The PR is out @sridharaiyer. You are welcome to test

ysolanky avatar May 20 '24 17:05 ysolanky

The PR is out @sridharaiyer. You are welcome to test

Tested. Works fine for my use case, thanks a lot!

sridharaiyer avatar May 20 '24 20:05 sridharaiyer

Merged

jacobweiss2305 avatar May 21 '24 12:05 jacobweiss2305