scrapely
scrapely copied to clipboard
support CJK string annotation; print readably CJK string in scrapely.tool's output
scrapely.tool will crash when using CJK string as annotation in scrapely.tool:
$ python -m scrapely.tool blog.json
scrapely> ta http://blog.douban.com/douban/2013/07/04/2630/
[0] http://blog.douban.com/douban/2013/07/04/2630/
scrapely> t 0 算法工程师如何改进豆瓣电影 TOP250
Traceback (most recent call last):
File "/usr/local/Cellar/python/2.7.5/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 162, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/usr/local/Cellar/python/2.7.5/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/Users/xyb/scrapely.xyb/scrapely/tool.py", line 189, in <module>
main()
File "/Users/xyb/scrapely.xyb/scrapely/tool.py", line 186, in main
t.cmdloop()
File "/usr/local/Cellar/python/2.7.5/Frameworks/Python.framework/Versions/2.7/lib/python2.7/cmd.py", line 142, in cmdloop
stop = self.onecmd(line)
File "/usr/local/Cellar/python/2.7.5/Frameworks/Python.framework/Versions/2.7/lib/python2.7/cmd.py", line 221, in onecmd
return func(arg)
File "/Users/xyb/scrapely.xyb/scrapely/tool.py", line 48, in do_t
selection = apply_criteria(criteria, tm)
File "/Users/xyb/scrapely.xyb/scrapely/tool.py", line 147, in apply_criteria
sel = tm.select(func)
File "scrapely/template.py", line 48, in select
score = score_func(fragment, htmlpage)
File "scrapely/template.py", line 95, in func
if text in fdata:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position 0: ordinal not in range(128)
I fixed it, and add improved the usability of scrapely.tool's output that including CJK unicode characters:
$ python -m scrapely.tool blog.json
scrapely> t 0 算法工程师如何改进豆瓣电影 TOP250
[0] u'<h1>算法工程师如何改进豆瓣电影 TOP250</h1>'
[1] u'<title>豆瓣blog » Blog Archive » 算法工程师如何改进豆瓣电影 TOP250</title>'
[2] u'<link rel="alternate" type="application/rss+xml" title="豆瓣blog » 算法工程师如何改进豆瓣电影 TOP250 评论 Feed" href="http://blog.douban.com/douban/2013/07/04/2630/feed/" />'
scrapely>
A doctest is reasonable. Actually I had tried adding a doctest on this but failed:
>>> u = u'cjk 中日韩 \\u535a'
>>> u
u'cjk \u4e2d\u65e5\u97e9 \\u535a'
>>> repr(u)
"u'cjk \\u4e2d\\u65e5\\u97e9 \\\\u535a'"
>>> print repr(u)
u'cjk \u4e2d\u65e5\u97e9 \\u535a'
>>> readable_repr(u)
u"u'cjk \u4e2d\u65e5\u97e9 \\\\u535a'"
>>> print readable_repr(u)
u'cjk 中日韩 \\u535a'
It's a copy of python shell output, can be used as document. But if your run it as doctest, you will get this strange result:
**********************************************************************
File "readable_repr.py", line 12, in __main__.readable_repr
Failed example:
u
Expected:
u'cjk \u4e2d\u65e5\u97e9 \u535a'
Got:
u'cjk \xe4\xb8\xad\xe6\x97\xa5\xe9\x9f\xa9 \u535a'
**********************************************************************
File "readable_repr.py", line 14, in __main__.readable_repr
Failed example:
repr(u)
Expected:
"u'cjk \u4e2d\u65e5\u97e9 \\u535a'"
Got:
"u'cjk \\xe4\\xb8\\xad\\xe6\\x97\\xa5\\xe9\\x9f\\xa9 \\u535a'"
**********************************************************************
File "readable_repr.py", line 16, in __main__.readable_repr
Failed example:
print repr(u)
Expected:
u'cjk \u4e2d\u65e5\u97e9 \u535a'
Got:
u'cjk \xe4\xb8\xad\xe6\x97\xa5\xe9\x9f\xa9 \u535a'
**********************************************************************
File "readable_repr.py", line 18, in __main__.readable_repr
Failed example:
readable_repr(u)
Expected:
u"u'cjk \u4e2d\u65e5\u97e9 \\u535a'"
Got:
u"u'cjk \\xe4\\xb8\\xad\\xe6\\x97\\xa5\\xe9\\x9f\\xa9 \u535a'"
/usr/local/Cellar/python/2.7.5/Frameworks/Python.framework/Versions/2.7/lib/python2.7/doctest.py:1531: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
if got == want:
/usr/local/Cellar/python/2.7.5/Frameworks/Python.framework/Versions/2.7/lib/python2.7/doctest.py:1551: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
if got == want:
**********************************************************************
File "readable_repr.py", line 20, in __main__.readable_repr
Failed example:
print readable_repr(u)
Expected:
u'cjk 中日韩 \u535a'
Got:
u'cjk \xe4\xb8\xad\xe6\x97\xa5\xe9\x9f\xa9 博'
**********************************************************************
1 items had failures:
5 of 6 in __main__.readable_repr
***Test Failed*** 5 failures.
In Python 2.x doctests just can't handle non-ascii text. There are some bugs about that in Python bug tracker, but as I recall they are all closed because the issue is fixed for Python 3.x. In 2.x it won't work.
Maybe just add a unittest if doctests don't handle non-ascii text in Python 2.x?
@pablohoffman, @kmike, Sorry for the delay replying, I have added unittests for the readable_repr
function and best_match text encoding correction(moved to scrapely.tool
already).
Any updates?
@akkatracker if you use latest scrapely master in Python 3 it should print all characters correctly. Fixing it for Python 2.x could be ugly.
Unicode input issues are fixed by #46, both for Python 2.x and 3.x.
The issue from the PR description should be fixed in scrapely master if you use Python 3.x. This PR provides some nice unit tests, fixes similar to #56 and an attempt to fix unicode output for Python 2.x (not finished), that's why it is not closed.