graby icon indicating copy to clipboard operation
graby copied to clipboard

Site config file not working

Open frankhubrepo opened this issue 3 years ago • 2 comments

I am trying to fetch the content from this article: https://www.businesstimes.com.sg/government-economy/malaysia-likely-to-impose-fresh-mco-in-kl-other-areas-report

However as it doesn't work, i tried adding a config file as shown here: https://doc.wallabag.org/en/user/errors_during_fetching.html

This is the code within the config file:

title://body//h1[@class="headline"]

body://body//div[contains(@class, "field-type-text-with-summary")]

test_url: https://www.businesstimes.com.sg/government-economy/malaysia-likely-to-impose-fresh-mco-in-kl-other-areas-report

The issue is even then I don't get the content, and I know the query is right because i can see it in the browser console: image

image

Also here is the log:

[2021-05-06 19:40:53] graby.INFO: Graby is ready to fetch [] []
[2021-05-06 19:40:53] graby.INFO: . looking for site config for {host} in primary folder {"host":"businesstimes.com.sg"} []
[2021-05-06 19:40:53] graby.INFO: ... found site config {host} {"host":"businesstimes.com.sg.txt"} []
[2021-05-06 19:40:53] graby.INFO: Appending site config settings from global.txt [] []
[2021-05-06 19:40:53] graby.INFO: . looking for site config for {host} in primary folder {"host":"global"} []
[2021-05-06 19:40:53] graby.INFO: ... found site config {host} {"host":"global.txt"} []
[2021-05-06 19:40:53] graby.INFO: Cached site config with key: {key} {"key":"businesstimes.com.sg"} []
[2021-05-06 19:40:53] graby.INFO: . looking for site config for {host} in primary folder {"host":"global"} []
[2021-05-06 19:40:53] graby.INFO: ... found site config {host} {"host":"global.txt"} []
[2021-05-06 19:40:53] graby.INFO: Appending site config settings from global.txt [] []
[2021-05-06 19:40:53] graby.INFO: Cached site config with key: {key} {"key":"global"} []
[2021-05-06 19:40:53] graby.INFO: Cached site config with key: {key} {"key":"businesstimes.com.sg.merged"} []
[2021-05-06 19:40:53] graby.INFO: Fetching url: {url} {"url":"https://www.businesstimes.com.sg/government-economy/malaysia-likely-to-impose-fresh-mco-in-kl-other-areas-report"} []
[2021-05-06 19:40:53] graby.INFO: Trying using method "{method}" on url "{url}" {"method":"get","url":"https://www.businesstimes.com.sg/government-economy/malaysia-likely-to-impose-fresh-mco-in-kl-other-areas-report"} []
[2021-05-06 19:40:53] graby.INFO: Use default user-agent "{user-agent}" for url "{url}" {"user-agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.92 Safari/535.2","url":"https://www.businesstimes.com.sg/government-economy/malaysia-likely-to-impose-fresh-mco-in-kl-other-areas-report"} []
[2021-05-06 19:40:53] graby.INFO: Use default referer "{referer}" for url "{url}" {"referer":"http://www.google.co.uk/url?sa=t&source=web&cd=1","url":"https://www.businesstimes.com.sg/government-economy/malaysia-likely-to-impose-fresh-mco-in-kl-other-areas-report"} []
[2021-05-06 19:41:02] graby.INFO: Data fetched: {data} {"data":{"effective_url":"https://www.businesstimes.com.sg/government-economy/malaysia-likely-to-impose-fresh-mco-in-kl-other-areas-report","body":"(only length for debug): 152622","headers":{"alt-svc":"clear","cache-control":"no-cache, no-store, must-revalidate","content-type":"text/html; charset=UTF-8","date":"Thu, 06 May 2021 17:40:53 GMT","expires":"0","istl-response":"1","pragma":"no-cache","referrer-policy":"no-referrer-when-downgrade, no-referrer-when-downgrade","server":"ECD (sgb/C7A3)","via":"1.1 google","x-ion-hop":"true","x-vmg-version":"v2.3.21","content-length":"152622"},"status":200}} []
[2021-05-06 19:41:02] graby.INFO: Treating as UTF-8 {"encoding":"utf-8"} []
[2021-05-06 19:41:03] graby.INFO: Looking for site config files to see if single page link exists [] []
[2021-05-06 19:41:03] graby.INFO: Returning cached and merged site config for {host} {"host":"businesstimes.com.sg"} []
[2021-05-06 19:41:03] graby.INFO: No "single_page_link" config found [] []
[2021-05-06 19:41:03] graby.INFO: Attempting to extract content [] []
[2021-05-06 19:41:03] graby.INFO: Returning cached and merged site config for {host} {"host":"businesstimes.com.sg"} []
[2021-05-06 19:41:03] graby.INFO: Strings replaced: {count} (find_string and/or replace_string) {"count":0} []
[2021-05-06 19:41:03] graby.INFO: Attempting to parse HTML with {parser} {"parser":"libxml"} []
[2021-05-06 19:41:03] graby.INFO: Body size after Readability: {length} {"length":96} []
[2021-05-06 19:41:03] graby.INFO: Opengraph "og:" data: {ogData} {"ogData":[]} []
[2021-05-06 19:41:03] graby.INFO: Opengraph "article:" data: {ogData} {"ogData":[]} []
[2021-05-06 19:41:03] graby.INFO: Trying {pattern} for title {"pattern":"//body//h1[@class=\"headline\"]"} []
[2021-05-06 19:41:03] graby.INFO: Trying {pattern} for title {"pattern":"//meta[@property=\"og:title\"]/@content"} []
[2021-05-06 19:41:03] graby.INFO: Trying {pattern} for date {"pattern":"//meta[@property=\"article:published_time\"]/@content"} []
[2021-05-06 19:41:03] graby.INFO: Trying {pattern} for language {"pattern":"//html[@lang]/@lang"} []
[2021-05-06 19:41:03] graby.INFO: Trying {pattern} for language {"pattern":"//meta[@name=\"DC.language\"]/@content"} []
[2021-05-06 19:41:03] graby.INFO: Trying {pattern} for body (content length: {content_length}) {"pattern":"//body//div[contains(@class, \"field-type-text-with-summary\")]","content_length":96} []
[2021-05-06 19:41:03] graby.INFO: Using Readability [] []
[2021-05-06 19:41:03] graby.INFO: Date is bad (strtotime failed): {date} {"date":null} []
[2021-05-06 19:41:03] graby.INFO: Success ? {is_success} {"is_success":false} []
[2021-05-06 19:41:03] graby.INFO: Extract failed [] []

Any insight on what could be happening here or something I'm missing?

frankhubrepo avatar May 06 '21 17:05 frankhubrepo

Recently, I've tried to make site-configs for wallabag server and I noticed some XPATH problem like this issue. You should check log/html.log. Graby uses the php-readability to process HTML, and it strips and flats many tags for readability. This mean XPATHs of a site-config won't be the same like XPATHs of browsers and you can't use them in the site-config directly.

In my case, I wanted to extract a "real" author and a "real" title from an article in some website. But I got nothing after processing. Even though, I used XPATHs which work correctly in Chrome and Firefox browser. I can't use https://siteconfig.fivefilters.org/ because it doesn't show CSS and XPATH bar in bottom when I tested that websites.

Put the debug settings in your some-graby-test.php file and run it.

$graby = new Graby([
    'debug' => true,
    'log_leve' => 'debug',
]);

Then, you can see the log/html.log file.

hwiorn avatar May 30 '21 14:05 hwiorn

The problem is that Graby is retrieving that HTML: response.html.txt Which is definitely not the one you are querying from your browser console.

Maybe we need to add some cookie for the request. I've tried some without success.

j0k3r avatar Oct 05 '21 12:10 j0k3r