graby
graby copied to clipboard
Fetching more than 20 sections?
Hello,
I am trying to grab a very long text: The birth and death of a bike company: What happened to SpeedX? | CyclingTips
While trying to fetch it on wallabag or f43.me, ~~I get only the first 20 sections of the content.~~
The end of the grabbed content is:
If you ever want to know how it feels when you’re in a start-up that goes bust – when hundreds of people lose their jobs and everything they’ve worked so hard to build – just ask a former Bluegogo or SpeedX employee. They’ll tell you; it feels like total despair.
Or, if you check the page source, is in the class et_pb_section_19 (starting a 0), so ~~I assume the parser stops after 20 block of content (but maybe it is unrelated).~~
EDIT: Not really 20 sections, just that only the 20th is actually grabbed, investigating a bit more
I tried using a custom siteconfig but got the same result
title: //head/title
body: //div[hasclass('et_builder_outer_content')]
Any idea?
Checking the debug log tab on f43.me and it seems content are truncated during the cleanupHtml but it's weird the content is really small in that log line ... :thinking:
Thanks for looking into it. I was playing around with the xpath and using the id to grab the text seems to work better
title: //head/title
body: //div[@id='et_builder_outer_content']
It is strange has the class is only present once so it should be similar right?
Should be, yep.
So, comparing both xpath, I get:
Using hasclass:
- Trying
//div[hasclass('et_builder_outer_content')]for body (content length: 133808) - Using Readability
- Detected date: 2019-06-15T11:03:39+00:00
- Detecting body
- Pruning content
Using @id=:
- Trying
//div[@id='et_builder_outer_content']for body (content length: 133808) - XPath: found "1" with
//div[@id='et_builder_outer_content'] - Pruning content
So I guess the following code is executed and Readability is used as a fallback (and has a bad parser)
https://github.com/j0k3r/graby/blob/39e9a8b687503fc030d4202f7f04e2e6418cef57/src/Extractor/ContentExtractor.php#L517-L527
But why did the XPath returned 1 for the second expression but not the first one?
Are you sure about the quoted code?