docling Control HTML document Unicode decoding

Requested feature

Sometimes previous tools, e.g., OCR libraries output incorrectly encoded HTML. Because of visual similarity, for example, some undesired and incorrect character like https://www.compart.com/en/unicode/U+E157 is encoded, instead of https://www.compart.com/en/unicode/U+2630. Currently, when Docling parses an HTML document with such a character, it (or rather, BeautifulSoup) escapes these characters. For example, this heading item:

<h2 id="contents">Contents<a class="headerlink" href="#contents" title="Permanent link"></a></h2>

ends up with the .text value:

'Contents\ue157'

I have not found a straightforward way to control this behavior from within Docling or BeautifulSoup.

Alternatives

I have not found a robust and direct method to process these escapes from within Python. String substitution tricks are possible but at a performance cost.

Jan 06 '25 10:01 sanmai-NL

@sanmai-NL I am not entirely sure what your request is. The escaped unicode in the string representation will actually print as a symbol, such as in:

> s = 'Contents\ue157'
> print(s)

Contents

How it prints depends on the interpreter.

Jan 07 '25 13:01 cau-git

It's a character we don't want. It's a data quality issue.

Jan 07 '25 15:01 sanmai-NL

We have a custom cleanup function now that filters based on Unicode General Category. This character makes no sense in document text. To reiterate, what we request is a way to control which characters end up in Docling text nodes.

Jan 07 '25 15:01 sanmai-NL

I'm a bit confused by the conversation of this issue. I see two possible interpretation:

The character which is read by the html backend is wrong, i.e. "https://www.compart.com/en/unicode/U+E157 is encoded, instead of https://www.compart.com/en/unicode/U+2630"
There seem to be a wish of skipping those "anchor / permalink / icon" html components

If we are talking about 1, I think it could be an encoding bug (to be verified). If we are talking about 2, it would require a design for custom logic in parsing html pages, which, unfortunately, seems very specific to the actual page.

Jan 30 '25 06:01 dolfim-ibm

It's about 1.

Jan 30 '25 07:01 sanmai-NL

I have a similar issue where the exported markdown includes encoded symbols such as & and /ff istead of & and ff. Is there a way to toggle the decoding of html like symbols?

Feb 05 '25 09:02 hey-nicolasklein

docling docling copied to clipboard

Control HTML document Unicode decoding

Requested feature

Alternatives

docling
docling copied to clipboard