Support other content-encodings, other then utf8 (support Swedish characters)
Reported by [email protected], Aug 13, 2010
When using Scrapemark to get text from Swedish websites or sites that are not using utf-8 as content encoding, which is common here. Scrapemark removes all special characters (åäö), the text "Hjälp" becomes "Hjlp".
What steps will reproduce the problem?
- Use a url of a homepage with content-encoding iso-8859-1 for example this Swedish homepage http://www.asciitabell.se/
- Scrape the
<title>{{ }}</title>
What is the expected output? What do you see instead? The output is "ASCII-tabellen (8 bitars utkad ASCII, enligt ISO 8859-1)" the expected result would be "ASCII-tabellen (8 bitars utökad ASCII, enligt ISO 8859-1)" (notice the o with two dots in the middle :))
What version of the product are you using? On what operating system? Version 0.9, Mac OSX Snow Leopard
Please provide any additional information below. I wrote a patch that fixes this by simple looking at the return header, if the header includes iso-8859 the result is decoded and then encoded to utf8 before sent to other functions. This could possible be done more generic to work with different content-encodings other then iso-8859. I wrote a patch that fixes this by simple looking at the return header, if the header includes iso-8859 the result is decoded and then encoded to utf8 before sent to other functions. This could possible be done more generic to work with different content-encodings other then iso-8859.
diff --git a/scrapemark.py b/scrapemark.py
index 7b4cf72..be0327c 100644
--- a/scrapemark.py
+++ b/scrapemark.py
@@ -530,7 +530,11 @@ def _decode_entities(s):
def _substitute_entity(m):
ent = m.group(2)
if m.group(1) == "#":
- return unichr(int(ent))
+ # Hex value
+ if ent[0] == 'x':
+ return unichr(int(ent[1:], 16))
+ else:
+ return unichr(int(ent))
else:
cp = name2codepoint.get(ent)
if cp: