docling
docling copied to clipboard
When parsing tables in a PDF, if the text content contains line breaks, the extracted content will include spaces, which may cause the RAG application to fail in retrieving relevant content when answering questions.
Bug
When parsing tables in a PDF, if the text content contains line breaks, the extracted content will include spaces, which may cause the RAG application to fail in retrieving relevant content when answering questions.
Steps to reproduce
Original PDF content
Extracted content parsed through Docling
持股比 例 (%)
Parse code
reader = DoclingReader(export_type=DoclingReader.ExportType.JSON)
docs = reader.load_data(pdf_path)
node_parser = DoclingNodeParser()
nodes = node_parser.get_nodes_from_documents(docs)
for node in nodes:
print(f"metadata:\n{node.metadata}")
print(f"text:\n{node.text}")
Asking questions in RAG applications
持股比例
Docling version
2.8.3
Python version
3.12