Numbered headings in Word documents appear as list items
First off, thank you for docling! <3
A standard representation, maintaining context and hierarchy, for content across multiple formats, with an MIT licence is just super! Fan of features like the hybrid text chunker.
Bug
Lots of long technical documents use multilevel lists in word to have numbered sections.
These documents sometimes also include numbered paragraphs.
At the moment, in the word backend, docling checks to see if an item is a list item and handles that case separately, before checking to see if it is a heading.
see: https://github.com/DS4SD/docling/blob/3bb3bf57150c9705a055982e6fb0cc8d1408f161/docling/backend/msword_backend.py#L244-L297
So paras/tags which are both a list item and a heading just get treated as a list item. It would probably be more useful to treat them as a heading, and convert the list index into plaintext.
I have had a go at adding a failing unit test, by adding a modified copy of unit_test_headers.docx and the expected ground truths for this case in a fork here: https://github.com/DS4SD/docling/commit/a54436046e822800abdb8fc0692acada32bd9d99
Have also attached the same example to this issue: unit_test_headers_numbered.docx
Current output:
# Test Document
- Section 1
Paragraph 1.1
Paragraph 1.2
Expected output:
# Test Document
## 1. Section 1
Paragraph 1.1
Paragraph 1.2
Steps to reproduce
Parse a word document with numbered headings like: unit_test_headers_numbered.docx
Docling version
Docling version: 2.12.0
Docling Core version: 2.9.0
Docling IBM Models version: 3.1.0
Docling Parse version: 3.0.0
Python version
Python 3.12.3
I think there is also a related issue where sometimes the first item of a list that is within a numbered heading section will go missing.
If useful I can create a failing test for that too?
@mattmalcher If you can provide us with failing tests that would be very helpful for checking, thanks.
I have added two failing tests, with ground truths in a branch in a fork here: https://github.com/mattmalcher/docling/tree/issue_612_docx_numbered_headings
For the issue with text going missing where numbered headings are involved:
Original Document
Expected (Markdown)
Actual (Markdown)
Note that heading 1.2 here has gone altogether!
I'm also running into this problem. It seems like Docling is not directly extracting the header data from word.
Original Document
Expected (doctags)
<section_header>1. Introduction</section_header>
<section_header>1.1 A Regulated Environment by Ensuring Good Laboratory Practices (GLP) Studies in Toxicology, Pathology, and Drug Development Against a Backdrop of Heterogeneous Technologies.</section_header>
<paragraph>Digital Toxicologic Histopathology has become a crucial aspect of the process used to establish drug safety [13], offering advanced methods to evaluate potential drug-induced toxicity, and enhancing patient safety. Deep Learning (DL) as part of Artificial intelligence (AI) applied to digital pathology is causing a revolution in the field, enabling automated analysis (AI/DL) and improved diagnostic accuracy, thus improving patient safety. Moreover, Microsoft usurped a common term from aeronautics, a Co-Pilot, which is really an autonomous agent that performs background work to simplify tasks and improve insights when applied to information. In Digital Histopathology we are both using AI/DL to score slides to match what a pathologist would do, but also to leverage Co-Pilot functions to guide them to slides of interest where abnormalities might form. Here are some of the many possible uses of AI/DL in Toxicologic Pathology:</paragraph>
<paragraph>AI in Toxicologic Pathology: AI/DL has significantly contributed to advancing the implementation of toxicological pathology, which focuses on evaluating drug safety. These technologies have shown promise in automating toxicological assessments, potentially leading to more personalized medicine approaches [1]. While the focus of this paper is on toxicologic pathology in drug safety and development, the system and novel approaches are being designed to span broader sets of use cases.</paragraph>
Actual
<list_item>Introduction</list_item>
<paragraph></paragraph>
<list_item>A Regulated Environment by Ensuring Good Laboratory Practices (GLP) Studies in Toxicology, Pathology, and Drug Development Against a Backdrop of Heterogeneous Technologies.</list_item>
<paragraph>Digital Toxicologic Histopathology has become a crucial aspect of the process used to establish drug safety [13], offering advanced methods to evaluate potential drug-induced toxicity, and enhancing patient safety. Deep Learning (DL) as part of Artificial intelligence (AI) applied to digital pathology is causing a revolution in the field, enabling automated analysis (AI/DL) and improved diagnostic accuracy, thus improving patient safety. Moreover, Microsoft usurped a common term from aeronautics, a Co-Pilot, which is really an autonomous agent that performs background work to simplify tasks and improve insights when applied to information. In Digital Histopathology we are both using AI/DL to score slides to match what a pathologist would do, but also to leverage Co-Pilot functions to guide them to slides of interest where abnormalities might form. Here are some of the many possible uses of AI/DL in Toxicologic Pathology:</paragraph>
<paragraph>AI in Toxicologic Pathology: AI/DL has significantly contributed to advancing the implementation of toxicological pathology, which focuses on evaluating drug safety. These technologies have shown promise in automating toxicological assessments, potentially leading to more personalized medicine approaches [1]. While the focus of this paper is on toxicologic pathology in drug safety and development, the system and novel approaches are being designed to span broader sets of use cases.</paragraph>
#795 Is probably the same issue
Edit
It is the same issue. As explained in the other issue, the debug shows that the label parsing produces a whitespace that breaks the logic. A possible solution is explained in the other issue
Same issue: https://github.com/DS4SD/docling/issues/795
Thank you @mattmalcher, @MiguelAngelTorres, @asvintheguy, this PR should resolve this issue: https://github.com/DS4SD/docling/pull/842
Shall take a look, thank you 😁