docling icon indicating copy to clipboard operation
docling copied to clipboard

Control HTML document Unicode decoding

Open sanmai-NL opened this issue 11 months ago • 6 comments

Requested feature

Sometimes previous tools, e.g., OCR libraries output incorrectly encoded HTML. Because of visual similarity, for example, some undesired and incorrect character like https://www.compart.com/en/unicode/U+E157 is encoded, instead of https://www.compart.com/en/unicode/U+2630. Currently, when Docling parses an HTML document with such a character, it (or rather, BeautifulSoup) escapes these characters. For example, this heading item:

<h2 id="contents">Contents<a class="headerlink" href="#contents" title="Permanent link"></a></h2>

ends up with the .text value:

'Contents\ue157'

I have not found a straightforward way to control this behavior from within Docling or BeautifulSoup.

Alternatives

I have not found a robust and direct method to process these escapes from within Python. String substitution tricks are possible but at a performance cost.

sanmai-NL avatar Jan 06 '25 10:01 sanmai-NL

@sanmai-NL I am not entirely sure what your request is. The escaped unicode in the string representation will actually print as a symbol, such as in:

> s = 'Contents\ue157'
> print(s)

Contents
image

How it prints depends on the interpreter.

cau-git avatar Jan 07 '25 13:01 cau-git

It's a character we don't want. It's a data quality issue.

sanmai-NL avatar Jan 07 '25 15:01 sanmai-NL

We have a custom cleanup function now that filters based on Unicode General Category. This character makes no sense in document text. To reiterate, what we request is a way to control which characters end up in Docling text nodes.

sanmai-NL avatar Jan 07 '25 15:01 sanmai-NL

I'm a bit confused by the conversation of this issue. I see two possible interpretation:

  1. The character which is read by the html backend is wrong, i.e. "https://www.compart.com/en/unicode/U+E157 is encoded, instead of https://www.compart.com/en/unicode/U+2630"
  2. There seem to be a wish of skipping those "anchor / permalink / icon" html components

If we are talking about 1, I think it could be an encoding bug (to be verified). If we are talking about 2, it would require a design for custom logic in parsing html pages, which, unfortunately, seems very specific to the actual page.

dolfim-ibm avatar Jan 30 '25 06:01 dolfim-ibm

It's about 1.

sanmai-NL avatar Jan 30 '25 07:01 sanmai-NL

I have a similar issue where the exported markdown includes encoded symbols such as &amp; and /ff istead of & and ff. Is there a way to toggle the decoding of html like symbols?

hey-nicolasklein avatar Feb 05 '25 09:02 hey-nicolasklein