data-prep-kit icon indicating copy to clipboard operation
data-prep-kit copied to clipboard

[Bug] Docling2parquet conversion of HTML file loses some content

Open shahrokhDaijavad opened this issue 4 months ago • 5 comments

Search before asking

  • [x] I searched the issues and found no similar issues.

Component

transforms/docling2parquet

What happened + What you expected to happen

When we try the attached html file as input, the contents column of the output parquet is missing the main part before the Reference section. I have attached parquet file 1 with the default ttext/markdown content-type option and parquet file 2 with text/plain option.

enwiki_namespace_0_0_306.html.txt

enwiki_namespace_0_0_306-1.parquet.txt

enwiki_namespace_0_0_306-2.parquet.txt

Reproduction script

Please see above.

Anything else

No response

OS

MacOS

Python

3.12

Are you willing to submit a PR?

  • [ ] Yes I am willing to submit a PR!

shahrokhDaijavad avatar Aug 04 '25 21:08 shahrokhDaijavad

cc: @dolfim-ibm, @touma-I

shahrokhDaijavad avatar Aug 04 '25 21:08 shahrokhDaijavad

Looking at the input PDF I think what you see is Docling setting all content before the first header as furniture. Since furnitures as excluded (by default) from the markdown and html output, they won't be in the current output.

You could add on option to allow furnitures in the output.

dolfim-ibm avatar Aug 05 '25 08:08 dolfim-ibm

Thanks, @dolfim-ibm You mean input HTML. What is the option that allows furnitures to be included in the output?

shahrokhDaijavad avatar Aug 05 '25 15:08 shahrokhDaijavad

from docling_core.types.doc import ContentLayer
document.export_to_markdown(included_content_layers={ContentLayer.BODY, ContentLayer.FURNITURE})

dolfim-ibm avatar Aug 05 '25 15:08 dolfim-ibm

Nice!!! Thanks @dolfim-ibm You are a life saver

touma-I avatar Aug 05 '25 18:08 touma-I