graby icon indicating copy to clipboard operation
graby copied to clipboard

Fetching more than 20 sections?

Open mart-e opened this issue 6 years ago • 5 comments

Hello,

I am trying to grab a very long text: The birth and death of a bike company: What happened to SpeedX? | CyclingTips

While trying to fetch it on wallabag or f43.me, ~~I get only the first 20 sections of the content.~~

The end of the grabbed content is:

If you ever want to know how it feels when you’re in a start-up that goes bust – when hundreds of people lose their jobs and everything they’ve worked so hard to build – just ask a former Bluegogo or SpeedX employee. They’ll tell you; it feels like total despair.

Or, if you check the page source, is in the class et_pb_section_19 (starting a 0), so ~~I assume the parser stops after 20 block of content (but maybe it is unrelated).~~

EDIT: Not really 20 sections, just that only the 20th is actually grabbed, investigating a bit more

I tried using a custom siteconfig but got the same result

title: //head/title
body: //div[hasclass('et_builder_outer_content')]

Screenshot of result

Any idea?

mart-e avatar Jun 14 '19 09:06 mart-e

Checking the debug log tab on f43.me and it seems content are truncated during the cleanupHtml but it's weird the content is really small in that log line ... :thinking:

j0k3r avatar Jun 17 '19 11:06 j0k3r

Thanks for looking into it. I was playing around with the xpath and using the id to grab the text seems to work better

title: //head/title
body: //div[@id='et_builder_outer_content']

It is strange has the class is only present once so it should be similar right?

mart-e avatar Jun 17 '19 12:06 mart-e

Should be, yep.

j0k3r avatar Jun 17 '19 12:06 j0k3r

So, comparing both xpath, I get: Using hasclass:

  • Trying //div[hasclass('et_builder_outer_content')] for body (content length: 133808)
  • Using Readability
  • Detected date: 2019-06-15T11:03:39+00:00
  • Detecting body
  • Pruning content

Using @id=:

  • Trying //div[@id='et_builder_outer_content'] for body (content length: 133808)
  • XPath: found "1" with //div[@id='et_builder_outer_content']
  • Pruning content

So I guess the following code is executed and Readability is used as a fallback (and has a bad parser)

https://github.com/j0k3r/graby/blob/39e9a8b687503fc030d4202f7f04e2e6418cef57/src/Extractor/ContentExtractor.php#L517-L527

But why did the XPath returned 1 for the second expression but not the first one?

mart-e avatar Jun 17 '19 12:06 mart-e

Are you sure about the quoted code?

j0k3r avatar Jun 17 '19 12:06 j0k3r