docling icon indicating copy to clipboard operation
docling copied to clipboard

When parsing tables in a PDF, if the text content contains line breaks, the extracted content will include spaces, which may cause the RAG application to fail in retrieving relevant content when answering questions.

Open tahitimoon opened this issue 1 year ago • 0 comments

Bug

When parsing tables in a PDF, if the text content contains line breaks, the extracted content will include spaces, which may cause the RAG application to fail in retrieving relevant content when answering questions.

Steps to reproduce

Original PDF content image

Extracted content parsed through Docling

持股比 例 (%)

Parse code

reader = DoclingReader(export_type=DoclingReader.ExportType.JSON)
docs = reader.load_data(pdf_path)
node_parser = DoclingNodeParser()
nodes = node_parser.get_nodes_from_documents(docs)
for node in nodes:
    print(f"metadata:\n{node.metadata}")
    print(f"text:\n{node.text}")

Asking questions in RAG applications

持股比例

Docling version

2.8.3

Python version

3.12

PDF

disu.pdf

tahitimoon avatar Dec 06 '24 15:12 tahitimoon