docling icon indicating copy to clipboard operation
docling copied to clipboard

Numbered headings in Word documents appear as list items

Open mattmalcher opened this issue 1 year ago • 3 comments

First off, thank you for docling! <3

A standard representation, maintaining context and hierarchy, for content across multiple formats, with an MIT licence is just super! Fan of features like the hybrid text chunker.

Bug

Lots of long technical documents use multilevel lists in word to have numbered sections.

These documents sometimes also include numbered paragraphs.

At the moment, in the word backend, docling checks to see if an item is a list item and handles that case separately, before checking to see if it is a heading.

see: https://github.com/DS4SD/docling/blob/3bb3bf57150c9705a055982e6fb0cc8d1408f161/docling/backend/msword_backend.py#L244-L297

So paras/tags which are both a list item and a heading just get treated as a list item. It would probably be more useful to treat them as a heading, and convert the list index into plaintext.

I have had a go at adding a failing unit test, by adding a modified copy of unit_test_headers.docx and the expected ground truths for this case in a fork here: https://github.com/DS4SD/docling/commit/a54436046e822800abdb8fc0692acada32bd9d99

Have also attached the same example to this issue: unit_test_headers_numbered.docx

Current output:

# Test Document

- Section 1

Paragraph 1.1

Paragraph 1.2

Expected output:

# Test Document
## 1. Section 1

Paragraph 1.1

Paragraph 1.2

Steps to reproduce

Parse a word document with numbered headings like: unit_test_headers_numbered.docx

Docling version

Docling version: 2.12.0
Docling Core version: 2.9.0
Docling IBM Models version: 3.1.0
Docling Parse version: 3.0.0

Python version

Python 3.12.3

mattmalcher avatar Dec 16 '24 19:12 mattmalcher

I think there is also a related issue where sometimes the first item of a list that is within a numbered heading section will go missing.

If useful I can create a failing test for that too?

mattmalcher avatar Dec 16 '24 19:12 mattmalcher

@mattmalcher If you can provide us with failing tests that would be very helpful for checking, thanks.

cau-git avatar Dec 18 '24 14:12 cau-git

I have added two failing tests, with ground truths in a branch in a fork here: https://github.com/mattmalcher/docling/tree/issue_612_docx_numbered_headings

For the issue with text going missing where numbered headings are involved:

Original Document image

Expected (Markdown) image

Actual (Markdown) Note that heading 1.2 here has gone altogether! image

mattmalcher avatar Dec 19 '24 18:12 mattmalcher

I'm also running into this problem. It seems like Docling is not directly extracting the header data from word.

Original Document Image

Expected (doctags)

<section_header>1. Introduction</section_header>
<section_header>1.1 A Regulated Environment by Ensuring Good Laboratory Practices (GLP) Studies in Toxicology, Pathology, and Drug Development Against a Backdrop of Heterogeneous Technologies.</section_header>
<paragraph>Digital Toxicologic Histopathology has become a crucial aspect of the process used to establish drug safety [13], offering advanced methods to evaluate potential drug-induced toxicity, and enhancing patient safety. Deep Learning (DL) as part of Artificial intelligence (AI) applied to digital pathology is causing a revolution in the field, enabling automated analysis (AI/DL) and improved diagnostic accuracy, thus improving patient safety.  Moreover, Microsoft usurped a common term from aeronautics, a Co-Pilot, which is really an autonomous agent that performs background work to simplify tasks and improve insights when applied to information.  In Digital Histopathology we are both using AI/DL to score slides to match what a pathologist would do, but also to leverage Co-Pilot functions to guide them to slides of interest where abnormalities might form.  Here are some of the many possible uses of AI/DL in Toxicologic Pathology:</paragraph>
<paragraph>AI in Toxicologic Pathology: AI/DL has significantly contributed to advancing the implementation of toxicological pathology, which focuses on evaluating drug safety. These technologies have shown promise in automating toxicological assessments, potentially leading to more personalized medicine approaches [1]. While the focus of this paper is on toxicologic pathology in drug safety and development, the system and novel approaches are being designed to span broader sets of use cases.</paragraph>

Actual

<list_item>Introduction</list_item>
<paragraph></paragraph>
<list_item>A Regulated Environment by Ensuring Good Laboratory Practices (GLP) Studies in Toxicology, Pathology, and Drug Development Against a Backdrop of Heterogeneous Technologies.</list_item>
<paragraph>Digital Toxicologic Histopathology has become a crucial aspect of the process used to establish drug safety [13], offering advanced methods to evaluate potential drug-induced toxicity, and enhancing patient safety. Deep Learning (DL) as part of Artificial intelligence (AI) applied to digital pathology is causing a revolution in the field, enabling automated analysis (AI/DL) and improved diagnostic accuracy, thus improving patient safety.  Moreover, Microsoft usurped a common term from aeronautics, a Co-Pilot, which is really an autonomous agent that performs background work to simplify tasks and improve insights when applied to information.  In Digital Histopathology we are both using AI/DL to score slides to match what a pathologist would do, but also to leverage Co-Pilot functions to guide them to slides of interest where abnormalities might form.  Here are some of the many possible uses of AI/DL in Toxicologic Pathology:</paragraph>
<paragraph>AI in Toxicologic Pathology: AI/DL has significantly contributed to advancing the implementation of toxicological pathology, which focuses on evaluating drug safety. These technologies have shown promise in automating toxicological assessments, potentially leading to more personalized medicine approaches [1]. While the focus of this paper is on toxicologic pathology in drug safety and development, the system and novel approaches are being designed to span broader sets of use cases.</paragraph>

asvintheguy avatar Jan 21 '25 20:01 asvintheguy

#795 Is probably the same issue

Edit

It is the same issue. As explained in the other issue, the debug shows that the label parsing produces a whitespace that breaks the logic. A possible solution is explained in the other issue

MiguelAngelTorres avatar Jan 27 '25 15:01 MiguelAngelTorres

Same issue: https://github.com/DS4SD/docling/issues/795

maxmnemonic avatar Jan 30 '25 10:01 maxmnemonic

Thank you @mattmalcher, @MiguelAngelTorres, @asvintheguy, this PR should resolve this issue: https://github.com/DS4SD/docling/pull/842

maxmnemonic avatar Jan 30 '25 13:01 maxmnemonic

Shall take a look, thank you 😁

mattmalcher avatar Jan 30 '25 13:01 mattmalcher