open-parse icon indicating copy to clipboard operation
open-parse copied to clipboard

PyMuPdf Hierarchal Headings

Open mingzhang798 opened this issue 1 year ago • 1 comments

Description

Can you combine pymupdf's pdf4llm.to_markdown() to make the parsed pdf more hierarchical (for example, use ("##", "Header 1") to represent the first-level heading, ("###", "Header 2") represents the second-level heading, ("####", "Header 3") represents the third-level heading, etc.), so that langchain can be used to parse using the MarkdownHeaderTextSplitter() method. link: https://python.langchain.com/docs/modules/data_connection/document_transformers/markdown_header_metadata/

mingzhang798 avatar Apr 26 '24 08:04 mingzhang798

Could you provide some examples of before and after?

Filimoa avatar Apr 28 '24 17:04 Filimoa

Closing due to inactivity

Filimoa avatar Jun 04 '24 14:06 Filimoa