gitbook2pdf icon indicating copy to clipboard operation
gitbook2pdf copied to clipboard

抓取某本书时lxml.etree模块抛出Error,附解决方案

Open yyihuan opened this issue 3 years ago • 0 comments

在抓取 https://hit-scir.gitbooks.io/neural-networks-and-deep-learning-zh_cn/content/ 这本书时,其它页面正常运作,但某页会出现错误并中断。

done :  https://hit-scir.gitbooks.io/neural-networks-and-deep-learning-zh_cn/content/chap3/c3s0.html
Traceback (most recent call last):
  File "gitbook.py", line 5, in <module>
    Gitbook2PDF(url).run()
  File "/Users/cxjh168/Downloads/gitbook2pdf-master/gitbook2pdf/gitbook2pdf.py", line 198, in run
    loop.run_until_complete(self.crawl_main_content(content_urls))
  File "/Users/cxjh168/anaconda3/lib/python3.7/asyncio/base_events.py", line 584, in run_until_complete
    return future.result()
  File "/Users/cxjh168/Downloads/gitbook2pdf-master/gitbook2pdf/gitbook2pdf.py", line 220, in crawl_main_content
    await asyncio.gather(*tasks)
  File "/Users/cxjh168/Downloads/gitbook2pdf-master/gitbook2pdf/gitbook2pdf.py", line 241, in gettext
    text = ChapterParser(metatext, title, level, ).parser()
  File "/Users/cxjh168/Downloads/gitbook2pdf-master/gitbook2pdf/gitbook2pdf.py", line 105, in parser
    return html.unescape(ET.tostring(context).decode())
  File "src/lxml/etree.pyx", line 3437, in lxml.etree.tostring
  File "src/lxml/serializer.pxi", line 139, in lxml.etree._tostring
  File "src/lxml/serializer.pxi", line 199, in lxml.etree._raiseSerialisationError
lxml.etree.SerialisationError: IO_ENCODER
(base) MacBook-Pro:gitbook2pdf-master

我的解决方案是,修改了gitbook2pdf.py文件的第105行,增加了encode

return html.unescape(ET.tostring(context).decode())  # 原来的
return html.unescape(ET.tostring(context, encoding='utf-8').decode())  # 修改后

然后可以正常运作了

yyihuan avatar Jun 13 '21 17:06 yyihuan