Scrapemark fails to decode hex-encoded HTML entities

Open arshaw opened this issue 14 years ago • 0 comments

Reported by [email protected], Aug 11, 2010

Scrape something that has an HTML entity encoded in hex (ex title of http://www.youtube.com/videos)

Entity should be decoded, instead a ValueError is thrown.

At the time of writing, the title for the above mentioned youtube page is (some whitespace removed for clarity):

<title>YouTube - &#x202a;Most viewed videos&#x202c;&lrm</title>

Testcode below:

#!/usr/bin/env python
import scrapemark

url = "http://www.youtube.com/videos"
data = scrapemark.scrape("<title>{{title}}</title>", url = url)
print data['title']

I've attached a patch

diff --git a/scrapemark.py b/scrapemark.py
index 7b4cf72..be0327c 100644
--- a/scrapemark.py
+++ b/scrapemark.py
@@ -530,7 +530,11 @@ def _decode_entities(s):
     def _substitute_entity(m):
    ent = m.group(2)
    if m.group(1) == "#":
-       return unichr(int(ent))
+       # Hex value
+       if ent[0] == 'x':
+           return unichr(int(ent[1:], 16))
+       else:
+           return unichr(int(ent))
    else:
        cp = name2codepoint.get(ent)
        if cp:

Feb 09 '11 05:02 arshaw