Cesar Berrospi Ramis issues

Results 11 issues of


                                            Cesar Berrospi Ramis

Integrate Docling in Elasticsearch

### Requested feature #### Background [Docling](https://github.com/DS4SD/docling) reads popular document formats (PDF, DOCX, PPTX, XLSX, Images, HTML, AsciiDoc & Markdown), converts them in a unified data model with rich document representation...

enhancement

Create a backend to transform XML files to DoclingDocument

### Requested feature The Docling library defines a `DeclarativeDocumentBackend` abstract class to transform different document formats to `DoclingDocument` without a recognition pipeline. Implementations include `HTMLDocumentBackend` for HTML pages and `MsWordDocumentBackend`...

enhancement

icebox

feat: create a backend to parse USPTO patents into DoclingDocument

Resolves #605 This PR implements the following changes: - Add a backend implementation to parse patent applications and grants from the United States Patent Office (USPTO). - Refactor the module...

Create a backend to transform USPTO patents (XML and TXT) to DoclingDocument

### Requested feature - The Docling library defines a `DeclarativeDocumentBackend` abstract class to transform different document formats to `DoclingDocument` without a recognition pipeline. Implementations include `HTMLDocumentBackend` for HTML pages and...

enhancement

fix(html): use 'start' attribute when parsing ordered lists from HTML docs

### Description This PR is about leveraging the `start` attribute in HTML `` tags (ordered lists) when parsing HTML documents. In HTML documents, items in ordered lists are always parsed...

bug

html

fix: guess HTML content starting with script tag

This PR improves the heuristic rule to detect the file type from the first section of its content. In particular, the function `docling.datamodel.document._DocumentConversionInput._detect_html_xhtml`. Even though the HTML5 specification recommends HTML...

bug

html

PPTX parsing: bullet points not grouped correctly under subheadings

### Discussed in https://github.com/docling-project/docling/discussions/1323 Originally posted by **harskuma** April 8, 2025 While working with PPTX files, I came across a formatting issue that could use some enhancement. Specifically, when a...

bug

pptx

good first issue

Improve JATS parsing with nested lists, inline formulas, ordered lists, content layers

### Requested feature To catch up with the latest `docling` and `docling-core` developments, we should add the following features in the JATS XML parser backend: - [ ] parse nested...

enhancement

xml

Allow generating a custom profile of question types

The default Q&A generation tries to generate 1 question of each type for every chunk. Some use cases may require more questions of a specific type (e.g., summary vs single...

enhancement

Improve the generation of QA pairs on tables

The current version supports Q&A generation on tabular data. With the [ChunkingDocSerializer](https://github.com/docling-project/docling-core/blob/main/docling_core/transforms/chunker/hierarchical_chunker.py#L175 we can leverage tables from chunks in markdown format and fine tuned the LLM prompts.

enhancement