Weird HTML entities in extract_from_html
Hi, I have a problem that talon responds with strange HTML entities in text when using extract_from_html.
Here I use Polish ł character:
quotations.extract_from_html('Napisał(a):\n<blockquote><span>x</span></blockquote>')
and I get response:
<html><head></head><body>Napisał(a):
</body></html>
these entities map to:
Å => Å
‚ => ‚
What's even stranger, when I replace x with ł inside blockquote, it responds with:
<html><head></head><body>Napisał(a):
</body></html>
where ł is, indeed, entity for ł character I would expect, so text would show correctly on website.
I have the same issue, how did you solve it?
I fixed it by encoding the string to bytes as unicode after reading this stackoverflow post.
quotations.extract_from(email_message.html.encode("iso-8859-1"), 'text/html')
The output went from
<html><head></head><body><div dir="ltr">Yes, I got your email. <br></div><br></body></html>
to
<html><head></head><body><div dir="ltr">Yes, I got your email. <br></div><br></body></html>
The culprit  is now gone.