talon icon indicating copy to clipboard operation
talon copied to clipboard

Weird HTML entities in extract_from_html

Open hodak opened this issue 8 years ago • 2 comments

Hi, I have a problem that talon responds with strange HTML entities in text when using extract_from_html.

File I used to reproduce it

Here I use Polish ł character:

quotations.extract_from_html('Napisał(a):\n<blockquote><span>x</span></blockquote>')

and I get response:

<html><head></head><body>Napisa&#197;&#8218;(a):
</body></html>

these entities map to:

&#197;  => Å
&#8218; => ‚

What's even stranger, when I replace x with ł inside blockquote, it responds with:

<html><head></head><body>Napisa&#322;(a):
</body></html>

where &#322; is, indeed, entity for ł character I would expect, so text would show correctly on website.

hodak avatar Jun 21 '17 13:06 hodak

I have the same issue, how did you solve it?

janwirth avatar Jun 21 '22 14:06 janwirth

I fixed it by encoding the string to bytes as unicode after reading this stackoverflow post.

quotations.extract_from(email_message.html.encode("iso-8859-1"), 'text/html')

The output went from

<html><head></head><body><div dir="ltr">Yes, I got your email.&#194;&#160;<br></div><br></body></html>

to

<html><head></head><body><div dir="ltr">Yes, I got your email.&#160;<br></div><br></body></html>

The culprit &#194; is now gone.

janwirth avatar Jun 21 '22 15:06 janwirth