scrapemark
scrapemark copied to clipboard
Scrapemark fails to decode hex-encoded HTML entities
Reported by [email protected], Aug 11, 2010
Scrape something that has an HTML entity encoded in hex (ex title of http://www.youtube.com/videos)
Entity should be decoded, instead a ValueError is thrown.
At the time of writing, the title for the above mentioned youtube page is (some whitespace removed for clarity):
<title>YouTube - ‪Most viewed videos‬&lrm</title>
Testcode below:
#!/usr/bin/env python
import scrapemark
url = "http://www.youtube.com/videos"
data = scrapemark.scrape("<title>{{title}}</title>", url = url)
print data['title']
I've attached a patch
diff --git a/scrapemark.py b/scrapemark.py
index 7b4cf72..be0327c 100644
--- a/scrapemark.py
+++ b/scrapemark.py
@@ -530,7 +530,11 @@ def _decode_entities(s):
def _substitute_entity(m):
ent = m.group(2)
if m.group(1) == "#":
- return unichr(int(ent))
+ # Hex value
+ if ent[0] == 'x':
+ return unichr(int(ent[1:], 16))
+ else:
+ return unichr(int(ent))
else:
cp = name2codepoint.get(ent)
if cp: