docling
docling copied to clipboard
Control HTML document Unicode decoding
Requested feature
Sometimes previous tools, e.g., OCR libraries output incorrectly encoded HTML. Because of visual similarity, for example, some undesired and incorrect character like https://www.compart.com/en/unicode/U+E157 is encoded, instead of https://www.compart.com/en/unicode/U+2630. Currently, when Docling parses an HTML document with such a character, it (or rather, BeautifulSoup) escapes these characters. For example, this heading item:
<h2 id="contents">Contents<a class="headerlink" href="#contents" title="Permanent link"></a></h2>
ends up with the .text value:
'Contents\ue157'
I have not found a straightforward way to control this behavior from within Docling or BeautifulSoup.
Alternatives
I have not found a robust and direct method to process these escapes from within Python. String substitution tricks are possible but at a performance cost.
@sanmai-NL I am not entirely sure what your request is. The escaped unicode in the string representation will actually print as a symbol, such as in:
> s = 'Contents\ue157'
> print(s)
Contents
How it prints depends on the interpreter.
It's a character we don't want. It's a data quality issue.
We have a custom cleanup function now that filters based on Unicode General Category. This character makes no sense in document text. To reiterate, what we request is a way to control which characters end up in Docling text nodes.
I'm a bit confused by the conversation of this issue. I see two possible interpretation:
- The character which is read by the html backend is wrong, i.e. "https://www.compart.com/en/unicode/U+E157 is encoded, instead of https://www.compart.com/en/unicode/U+2630"
- There seem to be a wish of skipping those "anchor / permalink / icon" html components
If we are talking about 1, I think it could be an encoding bug (to be verified). If we are talking about 2, it would require a design for custom logic in parsing html pages, which, unfortunately, seems very specific to the actual page.
It's about 1.
I have a similar issue where the exported markdown includes encoded symbols such as & and /ff istead of & and ff. Is there a way to toggle the decoding of html like symbols?