When parsing tables in a PDF, if the text content contains line breaks, the extracted content will include spaces, which may cause the RAG application to fail in retrieving relevant content when answering questions.

Open tahitimoon opened this issue 1 year ago • 0 comments

Bug

Steps to reproduce

Original PDF content

Extracted content parsed through Docling

持股比 例 (%)

Parse code

reader = DoclingReader(export_type=DoclingReader.ExportType.JSON)
docs = reader.load_data(pdf_path)
node_parser = DoclingNodeParser()
nodes = node_parser.get_nodes_from_documents(docs)
for node in nodes:
    print(f"metadata：\n{node.metadata}")
    print(f"text：\n{node.text}")

Asking questions in RAG applications

持股比例

Docling version

2.8.3

Python version

3.12

PDF

disu.pdf

Dec 06 '24 15:12 tahitimoon