safaribooks icon indicating copy to clipboard operation
safaribooks copied to clipboard

Parser: book content's corrupted or not present: 9781098122836

Open ivanpagac opened this issue 4 years ago • 8 comments

#] Parser: book content's corrupted or not present: node13-ch5.html (Chapter 5: Top 5 Developer-friendly Node.js API Frameworks)

however i can browse the page in browser without problem

https://learning.oreilly.com/library/view/nodejs-tools/9781098122836/Text/node13-ch5.html

ivanpagac avatar Apr 29 '20 05:04 ivanpagac

Have you tried with latest version?

lorenzodifuccia avatar May 20 '20 20:05 lorenzodifuccia

Yes, cloned today morning, same result, log attached

[21/May/2020 07:11:01] ** Welcome to SafariBooks! ** [21/May/2020 07:11:01] Logging into Safari Books Online... [21/May/2020 07:11:07] Successfully authenticated. [21/May/2020 07:11:07] Retrieving book info... [21/May/2020 07:11:07] Title: Node.js: Tools & Skills, 2nd Edition [21/May/2020 07:11:07] Authors: Manjunath M, Jay Raj, Nilson Jacques, Michael Wanyoike, James Hibbard [21/May/2020 07:11:07] Identifier: 9781098122836 [21/May/2020 07:11:07] ISBN: 9781925836394 [21/May/2020 07:11:07] Publishers: SitePoint [21/May/2020 07:11:07] Rights: Copyright © SitePoint [21/May/2020 07:11:07] Description: While there have been quite a few attempts to get JavaScript working as a server-side language, Node.js (frequently just called Node) has been the first environment that's gained any traction. It's now used by companies such as Netflix, Uber and Paypal to power their web apps. Node allows for blazingly fast performance; thanks to its event loop model, common tasks like network connection and database I/O can be executed very quickly indeed.In this book, we'll take a look at a selection of the re... [21/May/2020 07:11:07] Release Date: 2020-04-24 [21/May/2020 07:11:07] URL: https://learning.oreilly.com/library/view/nodejs-tools/9781098122836/ [21/May/2020 07:11:07] Retrieving book chapters... [21/May/2020 07:11:08] Output directory: /*************/Books/Node.js Tools _ Skills (9781098122836) [21/May/2020 07:11:08] Downloading book contents... (9 chapters) [21/May/2020 07:11:08] Crawler: found a new CSS at https://learning.oreilly.com/library/css/nodejs-tools/9781098122836/Styles/page_styles.css [21/May/2020 07:11:08] Crawler: found a new CSS at https://learning.oreilly.com/library/css/nodejs-tools/9781098122836/Styles/stylesheet.css [21/May/2020 07:11:08] Crawler: found a new CSS at https://learning.oreilly.com/static/CACHE/css/output.8054605313ed.css [21/May/2020 07:11:08] Created: node13-frontmatter.xhtml [21/May/2020 07:11:09] Created: node13-preface.xhtml [21/May/2020 07:11:09] Created: node13-ch1.xhtml [21/May/2020 07:11:09] Created: node13-ch2.xhtml [21/May/2020 07:11:10] Created: node13-ch3.xhtml [21/May/2020 07:11:11] Created: node13-ch4.xhtml [21/May/2020 07:11:11] Parser: book content's corrupted or not present: node13-ch5.html (Chapter 5: Top 5 Developer-friendly Node.js API Frameworks) [21/May/2020 07:11:11] Last request done: URL: https://learning.oreilly.com/api/v1/book/9781098122836/chapter-content/Text/node13-ch5.html DATA: None OTHERS: {}

ivanpagac avatar May 21 '20 05:05 ivanpagac

Hi @lorenzodifuccia thanks for the tool.

I have the same problem with other book Fluent Python

Here the log:

[10/Sep/2020 14:06:53] ** Welcome to SafariBooks! **
[10/Sep/2020 14:06:53] Logging into Safari Books Online...
[10/Sep/2020 14:06:58] Successfully authenticated.
[10/Sep/2020 14:06:58] Retrieving book info...
[10/Sep/2020 14:06:58] Title: Fluent Python, 2nd Edition
[10/Sep/2020 14:06:58] Authors: Luciano Ramalho
[10/Sep/2020 14:06:58] Identifier: 9781492056348
[10/Sep/2020 14:06:58] ISBN: 9781492056355
[10/Sep/2020 14:06:58] Publishers: O'Reilly Media, Inc.
[10/Sep/2020 14:06:58] Rights: Copyright © 2021 Luciano Ramalho
[10/Sep/2020 14:06:58] Description: Python’s simplicity lets you become productive quickly, but often this means you aren’t using everything it has to offer. With the updated edition of this hands-on guide, you’ll learn how to write effective, modern Python 3 code by leveraging its best ideas.Don’t waste time bending Python to fit patterns you learned in other languages. Discover and apply idiomatic Python 3 features beyond your past experience. Author Luciano Ramalho guides you through Python’s core language features and librarie...
[10/Sep/2020 14:06:58] Release Date: 2021-07-25
[10/Sep/2020 14:06:58] URL: https://learning.oreilly.com/library/view/fluent-python-2nd/9781492056348/
[10/Sep/2020 14:06:58] Retrieving book chapters...
[10/Sep/2020 14:07:01] Output directory:
    /Users/leninluque/safaribooks/Books/Fluent Python 2nd Edition (9781492056348)
[10/Sep/2020 14:07:01] Downloading book contents... (23 chapters)
[10/Sep/2020 14:07:01] Crawler: found a new CSS at https://learning.oreilly.com/library/css/fluent-python-2nd/9781492056348/epub.css
[10/Sep/2020 14:07:01] Crawler: found a new CSS at https://learning.oreilly.com/static/CACHE/css/output.731fc84c4f9a.css
[10/Sep/2020 14:07:01] Created: cover.xhtml
[10/Sep/2020 14:07:01] Created: toc01.xhtml
[10/Sep/2020 14:07:01] Created: titlepage01.xhtml
[10/Sep/2020 14:07:02] Created: copyright-page01.xhtml
[10/Sep/2020 14:07:02] Created: dedication01.xhtml
[10/Sep/2020 14:07:02] Created: preface01.xhtml
[10/Sep/2020 14:07:02] Created: part01.xhtml
[10/Sep/2020 14:07:03] Created: ch01.xhtml
[10/Sep/2020 14:07:03] Created: part02.xhtml
[10/Sep/2020 14:07:03] Created: ch02.xhtml
[10/Sep/2020 14:07:04] Created: ch03.xhtml
[10/Sep/2020 14:07:04] Parser: book content's corrupted or not present: ch04.html (4. Text versus Bytes)
[10/Sep/2020 14:07:04] Last request done:
	URL: https://learning.oreilly.com/api/v1/book/9781492056348/chapter-content/ch04.html
	DATA: None
	OTHERS: {}

	200
	Connection: keep-alive
	Content-Length: 59142
	Server: openresty/1.17.8.2
	Content-Type: text/html; charset=utf-8
	Allow: GET, HEAD, OPTIONS
	X-Frame-Options: SAMEORIGIN
	ETag: W/"853ff7c0c7c3aa72e3486ea1898ec20e"
	Content-Language: en-US
	strict-transport-security: "max-age=31536000; includeSubDomains"
	x-content-type-options: nosniff
	x-xss-protection: 1; mode=block
	Content-Encoding: gzip
	Cache-Control: s-maxage=31536000
	Accept-Ranges: bytes
	Date: Thu, 10 Sep 2020 17:07:04 GMT
	Via: 1.1 varnish
	X-Client-IP: 190.162.8.22
	X-Served-By: cache-scl19422-SCL
	X-Cache: MISS
	X-Cache-Hits: 0
	X-Timer: S1599757624.246264,VS0,VE311
	Vary: Accept-Encoding

Maybe if you have any idea what happens i can help you to fix it.

I think the page have a lot images and icons maybe is there the problem.

xleninx avatar Sep 10 '20 17:09 xleninx

https://learning.oreilly.com/library/view/advanced-engineering-mathematics/9781284105971/ Also fails to download

Bomberdash avatar Nov 23 '20 15:11 Bomberdash

Same problem here with Fluent Python 2nd Ed.

abreumatheus avatar Jul 04 '21 23:07 abreumatheus

Please upgrade lxml to the latest version.

In my case, lxml<=4.4.2 can't parse html content contains mathematical unicode characters(https://stackoverflow.com/questions/69334692/lxml-can-not-parse-html-fragment-contains-certain-unicode-character )

glasslion avatar Sep 27 '21 07:09 glasslion

A dead ugly workaround is to download the failing file again and then to use a slightly different way to parse.

Funnily enough the object returned by the parser has the wrong type Element and must be converted to a HtmlElement to match the expectations of the code using it later on. For this I apply fromstring and tostring conversions, which is certainly not an efficient approach, but my lxml foo is simply too weak. In my case this code executes rarely enough and is fast enough so that I don't care.

Because the whole thing is so cheesy and I don't even understand the root cause, I don't plan to create an MR. So the next best thing is to provide the patch below. To apply the patch store the patch into a file and apply it with git apply <patch file> onto the safaribooks git repo. If the patch fails to apply consider to checkout version af22b43c1 or a sufficiently compatible version and try again.

Limitation: Because I use path /tmp the hack will only work on *nix-based systems (incl. Macs), because I didn't bother to use use StringIO or at least the pythonic temporary file module.

diff --git a/safaribooks.py b/safaribooks.py
index 1d23bee..461e2ef 100755
--- a/safaribooks.py
+++ b/safaribooks.py
@@ -605,6 +605,16 @@ class SafariBooks:
 
         return root
 
+    def download_html_to_file(self, url, file_name):
+        response = self.requests_provider(url)
+        if response == 0 or response.status_code != 200:
+            self.display.exit(
+                "Crawler: error trying to retrieve this page: %s (%s)\n    From: %s" %
+                (self.filename, self.chapter_title, url)
+            )
+        with open(file_name, 'w') as file:
+            file.write(response.text)
+
     @staticmethod
     def url_is_absolute(url):
         return bool(urlparse(url).netloc)
@@ -652,17 +662,27 @@ class SafariBooks:
 
         return None
 
-    def parse_html(self, root, first_page=False):
+    def parse_html(self, root, url, first_page=False):
         if random() > 0.8:
             if len(root.xpath("//div[@class='controls']/a/text()")):
                 self.display.exit(self.display.api_error(" "))
 
         book_content = root.xpath("//div[@id='sbo-rt-content']")
         if not len(book_content):
-            self.display.exit(
-                "Parser: book content's corrupted or not present: %s (%s)" %
-                (self.filename, self.chapter_title)
-            )
+            filename = '/tmp/ch.html'
+            self.download_html_to_file(url, filename)
+            parser = etree.HTMLParser()
+            tree = etree.parse(filename, parser)
+            book_content = tree.xpath("//div[@id='sbo-rt-content']")
+            if not len(book_content):
+                self.display.exit(
+                    "Parser: book content's corrupted or not present: %s (%s)" %
+                    (self.filename, self.chapter_title)
+                )
+            # KLUDGE(KNR): When parsing this way the resulting object has type Element
+            # instead of HtmlElement. So perform a crude conversion into the right type.
+            from lxml.html import fromstring, tostring
+            book_content[0] = html.fromstring(tostring(book_content[0]))
 
         page_css = ""
         if len(self.chapter_stylesheets):
@@ -846,7 +867,10 @@ class SafariBooks:
                     self.display.book_ad_info = 2
 
             else:
-                self.save_page_html(self.parse_html(self.get_html(next_chapter["content"]), first_page))
+                chapter_ = next_chapter["content"]
+                html_ = self.get_html(chapter_)
+                parsed_page_ = self.parse_html(html_, chapter_, first_page)
+                self.save_page_html(parsed_page_)
 
             self.display.state(len_books, len_books - len(self.chapters_queue))

rknuus avatar Feb 04 '22 22:02 rknuus

chapter_ = next_chapter["content"]

  •            html_ = self.get_html(chapter_)
    
  •            parsed_page_ = self.parse_html(html_, chapter_, first_page)
    
  •            self.save_page_html(parsed_page_)
    

This solved for me

jvmachadorj avatar Mar 22 '22 02:03 jvmachadorj

A dead ugly workaround is to download the failing file again and then to use a slightly different way to parse.

Funnily enough the object returned by the parser has the wrong type Element and must be converted to a HtmlElement to match the expectations of the code using it later on. For this I apply fromstring and tostring conversions, which is certainly not an efficient approach, but my lxml foo is simply too weak. In my case this code executes rarely enough and is fast enough so that I don't care.

Because the whole thing is so cheesy and I don't even understand the root cause, I don't plan to create an MR. So the next best thing is to provide the patch below. To apply the patch store the patch into a file and apply it with git apply <patch file> onto the safaribooks git repo. If the patch fails to apply consider to checkout version af22b43 or a sufficiently compatible version and try again.

Limitation: Because I use path /tmp the hack will only work on *nix-based systems (incl. Macs), because I didn't bother to use use StringIO or at least the pythonic temporary file module.

diff --git a/safaribooks.py b/safaribooks.py
index 1d23bee..461e2ef 100755
--- a/safaribooks.py
+++ b/safaribooks.py
@@ -605,6 +605,16 @@ class SafariBooks:
 
         return root
 
+    def download_html_to_file(self, url, file_name):
+        response = self.requests_provider(url)
+        if response == 0 or response.status_code != 200:
+            self.display.exit(
+                "Crawler: error trying to retrieve this page: %s (%s)\n    From: %s" %
+                (self.filename, self.chapter_title, url)
+            )
+        with open(file_name, 'w') as file:
+            file.write(response.text)
+
     @staticmethod
     def url_is_absolute(url):
         return bool(urlparse(url).netloc)
@@ -652,17 +662,27 @@ class SafariBooks:
 
         return None
 
-    def parse_html(self, root, first_page=False):
+    def parse_html(self, root, url, first_page=False):
         if random() > 0.8:
             if len(root.xpath("//div[@class='controls']/a/text()")):
                 self.display.exit(self.display.api_error(" "))
 
         book_content = root.xpath("//div[@id='sbo-rt-content']")
         if not len(book_content):
-            self.display.exit(
-                "Parser: book content's corrupted or not present: %s (%s)" %
-                (self.filename, self.chapter_title)
-            )
+            filename = '/tmp/ch.html'
+            self.download_html_to_file(url, filename)
+            parser = etree.HTMLParser()
+            tree = etree.parse(filename, parser)
+            book_content = tree.xpath("//div[@id='sbo-rt-content']")
+            if not len(book_content):
+                self.display.exit(
+                    "Parser: book content's corrupted or not present: %s (%s)" %
+                    (self.filename, self.chapter_title)
+                )
+            # KLUDGE(KNR): When parsing this way the resulting object has type Element
+            # instead of HtmlElement. So perform a crude conversion into the right type.
+            from lxml.html import fromstring, tostring
+            book_content[0] = html.fromstring(tostring(book_content[0]))
 
         page_css = ""
         if len(self.chapter_stylesheets):
@@ -846,7 +867,10 @@ class SafariBooks:
                     self.display.book_ad_info = 2
 
             else:
-                self.save_page_html(self.parse_html(self.get_html(next_chapter["content"]), first_page))
+                chapter_ = next_chapter["content"]
+                html_ = self.get_html(chapter_)
+                parsed_page_ = self.parse_html(html_, chapter_, first_page)
+                self.save_page_html(parsed_page_)
 
             self.display.state(len_books, len_books - len(self.chapters_queue))

This works for me. Just one fix to avoid encoding issue:

parser = etree.HTMLParser(encoding='utf8')

astkaasa avatar Oct 07 '22 21:10 astkaasa

This works for me. Just one fix to avoid encoding issue:

parser = etree.HTMLParser(encoding='utf8')

also need to add from_encoding for BeautifulSoup:

tsoup = bs(txt, 'html.parser', from_encoding='utf8')

astkaasa avatar Oct 07 '22 21:10 astkaasa