[Bug] Docling2parquet conversion of HTML file loses some content
Search before asking
- [x] I searched the issues and found no similar issues.
Component
transforms/docling2parquet
What happened + What you expected to happen
When we try the attached html file as input, the contents column of the output parquet is missing the main part before the Reference section. I have attached parquet file 1 with the default ttext/markdown content-type option and parquet file 2 with text/plain option.
enwiki_namespace_0_0_306.html.txt
enwiki_namespace_0_0_306-1.parquet.txt
enwiki_namespace_0_0_306-2.parquet.txt
Reproduction script
Please see above.
Anything else
No response
OS
MacOS
Python
3.12
Are you willing to submit a PR?
- [ ] Yes I am willing to submit a PR!
cc: @dolfim-ibm, @touma-I
Looking at the input PDF I think what you see is Docling setting all content before the first header as furniture. Since furnitures as excluded (by default) from the markdown and html output, they won't be in the current output.
You could add on option to allow furnitures in the output.
Thanks, @dolfim-ibm You mean input HTML. What is the option that allows furnitures to be included in the output?
from docling_core.types.doc import ContentLayer
document.export_to_markdown(included_content_layers={ContentLayer.BODY, ContentLayer.FURNITURE})
Nice!!! Thanks @dolfim-ibm You are a life saver