Support other content-encodings, other then utf8 (support Swedish characters)

Open arshaw opened this issue 14 years ago • 0 comments

Reported by [email protected], Aug 13, 2010

When using Scrapemark to get text from Swedish websites or sites that are not using utf-8 as content encoding, which is common here. Scrapemark removes all special characters (åäö), the text "Hjälp" becomes "Hjlp".

What steps will reproduce the problem?

Use a url of a homepage with content-encoding iso-8859-1 for example this Swedish homepage http://www.asciitabell.se/
Scrape the <title>{{ }}</title>

What is the expected output? What do you see instead? The output is "ASCII-tabellen (8 bitars utkad ASCII, enligt ISO 8859-1)" the expected result would be "ASCII-tabellen (8 bitars utökad ASCII, enligt ISO 8859-1)" (notice the o with two dots in the middle :))

What version of the product are you using? On what operating system? Version 0.9, Mac OSX Snow Leopard

Please provide any additional information below. I wrote a patch that fixes this by simple looking at the return header, if the header includes iso-8859 the result is decoded and then encoded to utf8 before sent to other functions. This could possible be done more generic to work with different content-encodings other then iso-8859. I wrote a patch that fixes this by simple looking at the return header, if the header includes iso-8859 the result is decoded and then encoded to utf8 before sent to other functions. This could possible be done more generic to work with different content-encodings other then iso-8859.

diff --git a/scrapemark.py b/scrapemark.py
index 7b4cf72..be0327c 100644
--- a/scrapemark.py
+++ b/scrapemark.py
@@ -530,7 +530,11 @@ def _decode_entities(s):
 def _substitute_entity(m):
    ent = m.group(2)
    if m.group(1) == "#":
-       return unichr(int(ent))
+       # Hex value
+       if ent[0] == 'x':
+           return unichr(int(ent[1:], 16))
+       else:
+           return unichr(int(ent))
    else:
        cp = name2codepoint.get(ent)
        if cp:

Feb 09 '11 05:02 arshaw