[html] Docling produces empty Markdown file when parsing HTML document
Bug
Using the below HTML (actual content has been removed), docling outputs empty Markdown file.
Steps to reproduce
Save as HTML file:
<p>
<h1 id="1"><span style="color: rgb(0,0,0);">1</span></h1>
<div>
<p><br /></p>
<div class="table-wrap">
<table class="wrapped confluenceTable">
<colgroup>
<col />
<col />
<col />
</colgroup>
<tbody>
<tr>
<td class="confluenceTd">
<p><strong><span style="color: rgb(0,0,0);">2</span></strong></p>
</td>
<td class="confluenceTd">
<p><strong><span style="color: rgb(0,0,0);">3</span></strong></p>
</td>
<td class="confluenceTd">
<p><strong><span style="color: rgb(0,0,0);">4</span></strong></p>
</td>
</tr>
<tr>
<td class="confluenceTd">
<p><span style="color: rgb(0,0,0);">5</span></p>
</td>
<td class="confluenceTd">
<p><span style="color: rgb(0,0,0);">6</span>
</p>
</td>
<td class="confluenceTd">
<p><span style="color: rgb(0,0,0);">7</span></p>
</td>
</tr>
<tr>
<td class="confluenceTd">
<p><span style="color: rgb(0,0,0);">8</span></p>
</td>
<td class="confluenceTd">
<p><span style="color: rgb(0,0,0);">9</span></p>
</td>
<td class="confluenceTd">
<p><span style="color: rgb(0,0,0);">10</span></p>
</td>
</tr>
<tr>
<td class="confluenceTd">
<p><span style="color: rgb(0,0,0);">11</span></p>
</td>
<td class="confluenceTd">
<p><span style="color: rgb(0,0,0);">12</span></p>
</td>
<td class="confluenceTd">
<p><span style="color: rgb(0,0,0);">13</span></p>
</td>
</tr>
<tr>
<td class="confluenceTd">
<p><span style="color: rgb(255,0,255);">14</span></p>
</td>
<td class="confluenceTd">
<p><span style="color: rgb(0,0,0);">15</span></p>
</td>
<td class="confluenceTd">
<p><span style="color: rgb(0,0,0);">16</span></p>
</td>
</tr>
<tr>
<td class="confluenceTd">
<p><span style="color: rgb(255,0,255);">17</span></p>
</td>
<td class="confluenceTd">
<p><span style="color: rgb(0,0,0);">18</span></p>
</td>
<td class="confluenceTd">
<p><span style="color: rgb(0,0,0);">19</span></p>
<p><span style="color: rgb(0,0,0);">20</span></p>
</td>
</tr>
</tbody>
</table>
</div>
<h1 id="21"><span style="color: rgb(0,0,0);">21</span></h1>
</div>
<br /></p>
<p>2nd paragraph</p>
Run docling: docling <filename>.html. The output Markdown file is empty, however, it is expected to have some content.
Docling version
2025-11-12 16:11:32,416 - INFO - Loading plugin 'docling_defaults'
2025-11-12 16:11:32,417 - INFO - Registered ocr engines: ['auto', 'easyocr', 'ocrmac', 'rapidocr', 'tesserocr', 'tesseract']
Docling version: 2.61.2
Docling Core version: 2.51.0
Docling IBM Models version: 3.10.2
Docling Parse version: 4.7.1
Python: cpython-310 (3.10.12)
Platform: Linux-6.8.0-79-generic-x86_64-with-glibc2.35
Python version
Python 3.10.12
This is a known issue with Docling's HTML parsing, especially for documents with nested elements, tables, and styled spans. Docling only treats certain tags like <p>, <h1>–<h6>, <table>, and <ul> as structural sections. Content inside <div> or <span> tags may be ignored or missed, which can result in empty Markdown output—even if the HTML looks valid. There are also known bugs where content inside <span> tags is skipped, and complex tables or multiple paragraphs inside table cells can cause content loss or merging into a single line (source).
Recent versions (including 2.61.2) have had regressions in table parsing, sometimes leading to empty or incomplete output for complex HTML tables (source). Fixes are in progress, but issues may persist with certain HTML structures.
Workarounds:
- Preprocess your HTML to use supported block tags (e.g., replace
<div>with<p>or<section>). - For tables, replace consecutive
<p>tags inside<td>with<br>tags to preserve line breaks. - If you need to preserve content inside
<div>or<section>, you can extend the_BLOCK_TAGSvariable indocling/chunking/html.pyto include these tags. - For advanced cases, consider creating a custom Markdown serializer to handle complex table cells.
If you share a minimal example of your HTML, you can often pinpoint which tags are being skipped and adjust accordingly. For more details and ongoing fixes, see this issue and this PR.
To reply, just mention @dosu.
How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other
@tysonite the reason why you don't see any output on the markdown export is multiple. But first of all, please note that the code above does not conform to the HTML5 specifications: the paragraph element <p> may only contain phrasing content and cannot contain headings (for instance, check here). If you drop the first line (<p>) and its corresponding closing tag on the last but one line, you will get the markdown conversion that is expected. Browsers may still render the content since they will automatically close the paragraph before the heading starts.
Technically, you don't see any content since:
-
Docling checks for headings to determine what goes into the body content layer and what goes to the furniture. Docling has a convention that whatever is found before the first heading will be parsed in the furniture, unless there is no heading at all. In the case above, there is a heading, but when the document is parsed, the top level element (the
<p>) appears before any heading, so it is automatically placed in the furniture. The same happens with the second top element<p>2nd paragraph</p>, which will also be placed in the furniture -
All the content within the first paragraph (including a table) is seen as a phrasing content by Docling and thus the structure of the table is lost.
-
As explained above, Docling does parse the content in the
DoclingDocumentbut it puts in the furniture. By default, the export to markdown only considers the content in the body layer. If you want to export the furniture too, you need to explicitly request it. For instance, in the python API:from docling.document_converter import DocumentConverter, InputFormat from docling_core.types.doc.document import ContentLayer converter = DocumentConverter(allowed_formats=[InputFormat.HTML]) result = converter.convert("your_file.html") md = result.document.export_to_markdown( included_content_layers={ContentLayer.BODY, ContentLayer.FURNITURE})
With all that said, we will suggest a change in the HTML backend parser that automatically closes any paragraph before a heading starts, even if this situation originates from an invalid markup.
With all that said, we will suggest a change in the HTML backend parser that automatically closes any paragraph before a heading starts, even if this situation originates from an invalid markup.
Thanks, @ceberam, for your response and for analyzing this issue. The HTML sample I included in the issue description is generated by Atlassian Confluence, so I do not think I can easily change their HTML exporter. Is the HTML backend parser mentioned in your comment part of Confluence, or is it part of the Docling backend used to process HTML produced by Confluence?
Is the HTML backend parser mentioned in your comment part of Confluence, or is it part of the Docling backend used to process HTML produced by Confluence?
@tysonite the HTML parser (HTMLDocumentBackend) is part of Docling and it is intended to parse any HTML5 document.
In any case, I'm testing a patch that deals with this type of HTML pages.