Cesar Berrospi Ramis
Cesar Berrospi Ramis
### Requested feature #### Background [Docling](https://github.com/DS4SD/docling) reads popular document formats (PDF, DOCX, PPTX, XLSX, Images, HTML, AsciiDoc & Markdown), converts them in a unified data model with rich document representation...
### Requested feature The Docling library defines a `DeclarativeDocumentBackend` abstract class to transform different document formats to `DoclingDocument` without a recognition pipeline. Implementations include `HTMLDocumentBackend` for HTML pages and `MsWordDocumentBackend`...
Resolves #605 This PR implements the following changes: - Add a backend implementation to parse patent applications and grants from the United States Patent Office (USPTO). - Refactor the module...
### Requested feature - The Docling library defines a `DeclarativeDocumentBackend` abstract class to transform different document formats to `DoclingDocument` without a recognition pipeline. Implementations include `HTMLDocumentBackend` for HTML pages and...
### Description This PR is about leveraging the `start` attribute in HTML `` tags (ordered lists) when parsing HTML documents. In HTML documents, items in ordered lists are always parsed...
This PR improves the heuristic rule to detect the file type from the first section of its content. In particular, the function `docling.datamodel.document._DocumentConversionInput._detect_html_xhtml`. Even though the HTML5 specification recommends HTML...
### Discussed in https://github.com/docling-project/docling/discussions/1323 Originally posted by **harskuma** April 8, 2025 While working with PPTX files, I came across a formatting issue that could use some enhancement. Specifically, when a...
### Requested feature To catch up with the latest `docling` and `docling-core` developments, we should add the following features in the JATS XML parser backend: - [ ] parse nested...
The default Q&A generation tries to generate 1 question of each type for every chunk. Some use cases may require more questions of a specific type (e.g., summary vs single...
The current version supports Q&A generation on tabular data. With the [ChunkingDocSerializer](https://github.com/docling-project/docling-core/blob/main/docling_core/transforms/chunker/hierarchical_chunker.py#L175 we can leverage tables from chunks in markdown format and fine tuned the LLM prompts.