data-prep-kit [Bug] Docling2parquet conversion of HTML file loses some content

Search before asking

[x] I searched the issues and found no similar issues.

Component

transforms/docling2parquet

What happened + What you expected to happen

When we try the attached html file as input, the contents column of the output parquet is missing the main part before the Reference section. I have attached parquet file 1 with the default ttext/markdown content-type option and parquet file 2 with text/plain option.

enwiki_namespace_0_0_306.html.txt

enwiki_namespace_0_0_306-1.parquet.txt

enwiki_namespace_0_0_306-2.parquet.txt

Reproduction script

Please see above.

Anything else

No response

OS

MacOS

Python

3.12

Are you willing to submit a PR?

[ ] Yes I am willing to submit a PR!

Aug 04 '25 21:08 shahrokhDaijavad

cc: @dolfim-ibm, @touma-I

Aug 04 '25 21:08 shahrokhDaijavad

Looking at the input PDF I think what you see is Docling setting all content before the first header as furniture. Since furnitures as excluded (by default) from the markdown and html output, they won't be in the current output.

You could add on option to allow furnitures in the output.

Aug 05 '25 08:08 dolfim-ibm

Thanks, @dolfim-ibm You mean input HTML. What is the option that allows furnitures to be included in the output?

Aug 05 '25 15:08 shahrokhDaijavad

from docling_core.types.doc import ContentLayer
document.export_to_markdown(included_content_layers={ContentLayer.BODY, ContentLayer.FURNITURE})

Aug 05 '25 15:08 dolfim-ibm

Nice!!! Thanks @dolfim-ibm You are a life saver

Aug 05 '25 18:08 touma-I