zimit icon indicating copy to clipboard operation
zimit copied to clipboard

constantly getting TargetCloseError ProtocolError on specific page i want to zim

Open clydzik opened this issue 8 months ago • 11 comments

Hello. Using pretty recent zimit version:

ghcr.io/openzim/zimit       latest    24d0e3419bf1   5 weeks ago     3.81GB

I try to crawl a site. I tried to change parameters, also put delays to be nice but i still cannot figure out if i'm just banned or not. During errors i'm able to open specific pages with side browser.

Also when i open the page i see some requests are timing out for some domains like:

https://hg1.hitbox.com/HG?hc=w153&l=y&hb=WQ500615O5SF28EN0&cd=1&n=DATABASE

I tried to mitigate it by not waiting for the pages to fully load --waitUntil domcontentloaded

this is more less my (truncated to important ones) list of arguments:

--pageLoadTimeout 15 --behaviorTimeout 31 --waitUntil domcontentloaded --pageExtraDelay 5 --workers 1

this is the logs around the moment i start receiving strange ProtocolError

{"timestamp":"2025-05-20T12:20:55.314Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":487,"total":1377,"pen
ding":1,"failed":0,"limit":{"max":10000,"hit":false},"pendingPages":["{\"seedId\":0,\"started\":\"2025-05-20T12:20:55.309Z\",\"extraHops\":0,\"url\":\"ht
tps:\\/\\/gb64.com\\/oldsite\\/gameofweek\\/17\\/util_speech.htm\",\"added\":\"2025-05-20T10:13:06.963Z\",\"depth\":2}"]}}
{"timestamp":"2025-05-20T12:21:10.330Z","logLevel":"error","context":"fetch","message":"Direct fetch of page URL timed out","details":{"seconds":15,"page
":"https://gb64.com/oldsite/gameofweek/17/util_speech.htm","workerid":0}}
{"timestamp":"2025-05-20T12:21:10.366Z","logLevel":"info","context":"worker","message":"Starting page","details":{"workerid":0,"page":"https://gb64.com/f
orum/search.php?search_id=unanswered&sid=87de961a338d260a59775a7bfaea55cc"}}
{"timestamp":"2025-05-20T12:21:10.371Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":487,"total":1377,"pen
ding":1,"failed":0,"limit":{"max":10000,"hit":false},"pendingPages":["{\"seedId\":0,\"started\":\"2025-05-20T12:21:10.365Z\",\"extraHops\":0,\"url\":\"ht
tps:\\/\\/gb64.com\\/forum\\/search.php?search_id=unanswered&sid=87de961a338d260a59775a7bfaea55cc\",\"added\":\"2025-05-20T10:13:26.133Z\",\"depth\":2}"]
}}
{"timestamp":"2025-05-20T12:21:25.387Z","logLevel":"error","context":"fetch","message":"Direct fetch of page URL timed out","details":{"seconds":15,"page
":"https://gb64.com/forum/search.php?search_id=unanswered&sid=87de961a338d260a59775a7bfaea55cc","workerid":0}}
{"timestamp":"2025-05-20T12:21:25.421Z","logLevel":"info","context":"worker","message":"Starting page","details":{"workerid":0,"page":"https://gb64.com/f
orum/search.php?search_id=active_topics&sid=87de961a338d260a59775a7bfaea55cc"}}
{"timestamp":"2025-05-20T12:21:25.423Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":487,"total":1377,"pen
ding":1,"failed":0,"limit":{"max":10000,"hit":false},"pendingPages":["{\"seedId\":0,\"started\":\"2025-05-20T12:21:25.420Z\",\"extraHops\":0,\"url\":\"ht
tps:\\/\\/gb64.com\\/forum\\/search.php?search_id=active_topics&sid=87de961a338d260a59775a7bfaea55cc\",\"added\":\"2025-05-20T10:13:26.136Z\",\"depth\":2
}"]}}
{"timestamp":"2025-05-20T12:21:40.440Z","logLevel":"error","context":"fetch","message":"Direct fetch of page URL timed out","details":{"seconds":15,"page
":"https://gb64.com/forum/search.php?search_id=active_topics&sid=87de961a338d260a59775a7bfaea55cc","workerid":0}}
{"timestamp":"2025-05-20T12:21:40.470Z","logLevel":"info","context":"worker","message":"Starting page","details":{"workerid":0,"page":"https://gb64.com/f
orum/search.php?sid=87de961a338d260a59775a7bfaea55cc"}}
{"timestamp":"2025-05-20T12:21:40.472Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":487,"total":1377,"pen
ding":1,"failed":0,"limit":{"max":10000,"hit":false},"pendingPages":["{\"seedId\":0,\"started\":\"2025-05-20T12:21:40.469Z\",\"extraHops\":0,\"url\":\"ht
tps:\\/\\/gb64.com\\/forum\\/search.php?sid=87de961a338d260a59775a7bfaea55cc\",\"added\":\"2025-05-20T10:13:26.139Z\",\"depth\":2}"]}}
{"timestamp":"2025-05-20T12:21:55.488Z","logLevel":"error","context":"fetch","message":"Direct fetch of page URL timed out","details":{"seconds":15,"page
":"https://gb64.com/forum/search.php?sid=87de961a338d260a59775a7bfaea55cc","workerid":0}}
{"timestamp":"2025-05-20T12:21:55.518Z","logLevel":"info","context":"worker","message":"Starting page","details":{"workerid":0,"page":"https://gb64.com/f
orum/app.php/help/faq?sid=87de961a338d260a59775a7bfaea55cc"}}
{"timestamp":"2025-05-20T12:21:55.523Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":487,"total":1377,"pen
ding":1,"failed":0,"limit":{"max":10000,"hit":false},"pendingPages":["{\"seedId\":0,\"started\":\"2025-05-20T12:21:55.517Z\",\"extraHops\":0,\"url\":\"ht
tps:\\/\\/gb64.com\\/forum\\/app.php\\/help\\/faq?sid=87de961a338d260a59775a7bfaea55cc\",\"added\":\"2025-05-20T10:13:26.140Z\",\"depth\":2}"]}}
{"timestamp":"2025-05-20T12:22:10.539Z","logLevel":"error","context":"fetch","message":"Direct fetch of page URL timed out","details":{"seconds":15,"page
":"https://gb64.com/forum/app.php/help/faq?sid=87de961a338d260a59775a7bfaea55cc","workerid":0}}
{"timestamp":"2025-05-20T12:22:10.626Z","logLevel":"warn","context":"recorder","message":"Error getting cookies","details":{"page":"https://gb64.com/olds
ite/gameofweek/6/gotw_scoop!.htm","e":{"name":"TargetCloseError","cause":{"name":"ProtocolError"}}}}
{"timestamp":"2025-05-20T12:22:10.627Z","logLevel":"warn","context":"recorder","message":"Error getting cookies","details":{"page":"https://gb64.com/olds
ite/gameofweek/7/gotw_adventureconstrset.htm","e":{"name":"TargetCloseError","cause":{"name":"ProtocolError"}}}}
{"timestamp":"2025-05-20T12:22:10.627Z","logLevel":"warn","context":"recorder","message":"Error getting cookies","details":{"page":"https://gb64.com/olds
ite/gameofweek/7/gotw_robinofsherwood.htm","e":{"name":"TargetCloseError","cause":{"name":"ProtocolError"}}}}
{"timestamp":"2025-05-20T12:22:10.628Z","logLevel":"warn","context":"recorder","message":"Error getting cookies","details":{"page":"https://gb64.com/olds
ite/gameofweek/7/gotw_starcross.htm","e":{"name":"TargetCloseError","cause":{"name":"ProtocolError"}}}}
{"timestamp":"2025-05-20T12:22:10.629Z","logLevel":"warn","context":"recorder","message":"Error getting cookies","details":{"page":"https://gb64.com/olds
ite/gameofweek/7/gotw_wizardandprincess.htm","e":{"name":"TargetCloseError","cause":{"name":"ProtocolError"}}}}
{"timestamp":"2025-05-20T12:22:10.630Z","logLevel":"warn","context":"recorder","message":"Error getting cookies","details":{"page":"https://gb64.com/olds
ite/gameofweek/8/gotw_magiciansball.htm","e":{"name":"TargetCloseError","cause":{"name":"ProtocolError"}}}}

i also got single more meaning rich error:

{"timestamp":"2025-05-20T12:22:10.767Z","logLevel":"warn","context":"behavior","message":"Behavior run partially failed","details":{"reason":{"type":"exception","message":"Protocol error (Runtime.evaluate): Target closed","stack":"TargetCloseError: Protocol error (Runtime.evaluate): Target closed\n    at CallbackRegistry.clear (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/CallbackRegistry.js:77:36)\n    at CdpCDPSession._onClosed (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/cdp/CDPSession.js:106:25)\n    at Connection.onMessage (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/cdp/Connection.js:130:25)\n    at WebSocket.<anonymous> (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/node/NodeWebSocketTransport.js:38:32)\n    at callListener (/app/node_modules/puppeteer-core/node_modules/ws/lib/event-target.js:290:14)\n    at WebSocket.onMessage (/app/node_modules/puppeteer-core/node_modules/ws/lib/event-target.js:209:9)\n    at WebSocket.emit (node:events:524:28)\n    at Receiver.receiverOnMessage (/app/node_modules/puppeteer-core/node_modules/ws/lib/websocket.js:1220:20)\n    at Receiver.emit (node:events:524:28)\n    at Immediate.<anonymous> (/app/node_modules/puppeteer-core/node_modules/ws/lib/receiver.js:601:16)"},"page":"https://gb64.com/oldsite/gameofweek/6/gotw_redmoon.htm","workerid":0}}
{"timestamp":"2025-05-20T12:22:15.809Z","logLevel":"info","context":"worker","message":"Starting page","details":{"workerid":0,"page":"http://gb64.com/"}}

It seems that before i get this ProtocolError i got a lot of timeouts. And the crawl seems to be stucks on statistics.

So Am i simbly banned or there is an issue in the crawl maybe caused by this hangling request to hg1.hitbox.com ?

clydzik avatar May 20 '25 13:05 clydzik

To me this looks more like a problem of communication between the crawler and the Chromium browser. Which might be induced by a variety of reasons, but basically browser seems to be mostly crashed. Could be that website is using too much resources, exhausting memory due to bad JS, ...

benoit74 avatar May 20 '25 13:05 benoit74

For now i can confirm that the problem is in the subpages. At some page the layout is linked to a oldpage and crawler follows this.

As wrote above example is https://gb64.com/oldsite/gameofweek/6/gotw_redmoon.htm

These oldpage does not look wery complex - it is pretty old but has a lot of this hits that never ends like: https://hg1.hitbox.com/HG?hc=w153&l=y&hb=WQ500615O5SF28EN0&cd=1&n=DATABASE

So as You wrote either

  • the page is complex and crash the brovser at some point
  • these dangling requests are finally crashing the browser like https://hg1.hitbox.com/HG?hc=w153&l=y&hb=WQ500615O5SF28EN0&cd=1&n=DATABASE

when i did exclude it by: --scopeExcludeRx='.*oldsite.*'

The crawl continues without any issues

As i think there is no way to skip exact urls from loading when a page is loaded during the crawl ?

clydzik avatar May 22 '25 10:05 clydzik

You should be able to block any request from a page to hitbox.com with blockRules:

      --blockRules                          Additional rules for blocking certai
                                            n URLs from being loaded, by URL reg
                                            ex and optionally via text match in
                                            an iframe      [array] [default: []]

Never used them myself so I'm not certain which format it should have, but this is supposed to work.

Some documentation is available at https://crawler.docs.browsertrix.com/user-guide/crawl-scope/#page-resource-block-rules

benoit74 avatar May 22 '25 11:05 benoit74

To be honest i'm not sure if i use --blockRules correctly or it it is not working as expected example like here: https://github.com/webrecorder/browsertrix-crawler/issues/574

It seems it blocks all resources.

I use it like this

--blockRules ['https://hg1.hitbox.com','https://cloud.cbm8bit.com']

but tried also some other options.

It seems the params are passed:

[zimit::2025-05-24 15:43:32,252] INFO:Running browsertrix-crawler crawl: crawl --title gamebase 64 --description An attempt to document ALL Commodore 64 gameware before its too late --workers 2 --waitUntil domcontentloaded --depth 1 --pageLoadTimeout 91 --blockRules [https://hg1.hitbox.com,https://cloud.cbm8bit.com] --behaviorTimeout 31 --diskUtilization 90 --seeds https://gb64.com --userAgentSuffix +Zimit --cwd /output/.tmp2ljdqp33

than i get errors

{"timestamp":"2025-05-24T15:43:40.264Z","logLevel":"warn","context":"blocking","message":"Block rule match for page request ignored, set --exclude to block full pages","details":{"url":"https://gb64.com/","page":"about:blank?_browsertrixh17wu74mqz"}}
{"timestamp":"2025-05-24T15:43:40.631Z","logLevel":"warn","context":"recorder","message":"Skipping URL from unknown frame","details":{"url":"https://gb64.com/","frameId":"695F1629F140DC66E164FE250ABE7749"}}
{"timestamp":"2025-05-24T15:43:42.168Z","logLevel":"warn","context":"recorder","message":"Request failed","details":{"url":"https://gb64.com/images/c64top/cornerfiller.gif","errorText":"net::ERR_BLOCKED_BY_CLIENT.Inspector","type":"Image","status":0,"page":"https://gb64.com/","workerid":0}}

bassically i get a lot of ERR_BLOCKED_BY_CLIENT and pages are there but with no resources

Image

🤔

clydzik avatar May 24 '25 15:05 clydzik

yeah maybe i wrote too fast. It seems passing urls twice do the trick and it seems it works.

--blockRules 'https://hg1.hitbox.com' --blockRules 'https://cloud.cbm8bit.com'

Will do a bigger crawl to confirm if this helps and prevent the browser to crash on oldsite problem mentionned earlier...

clydzik avatar May 24 '25 16:05 clydzik

unfortuantelly still at some point the browser seems to crash

{"timestamp":"2025-05-24T16:49:56.208Z","logLevel":"warn","context":"behavior","message":"Behaviors timed out","details":{"seconds":31,"page":"https://gb64.com/oldsite/gameofweek/4/americanfeature/gotw_stripoker.htm","workerid":1}}
{"timestamp":"2025-05-24T16:49:58.716Z","logLevel":"info","context":"pageStatus","message":"Page Finished","details":{"loadState":3,"page":"https://gb64.com/oldsite/gameofweek/4/americanfeature/gotw_stripoker.htm","workerid":1}}
{"timestamp":"2025-05-24T16:49:58.764Z","logLevel":"warn","context":"behavior","message":"Behavior run partially failed","details":{"reason":{"type":"exception","message":"Protocol error (Runtime.evaluate): Target closed","stack":"TargetCloseError: Protocol error (Runtime.evaluate): Target closed\n    at CallbackRegistry.clear (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/CallbackRegistry.js:77:36)\n    at CdpCDPSession._onClosed (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/cdp/CDPSession.js:106:25)\n    at Connection.onMessage (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/cdp/Connection.js:130:25)\n    at WebSocket.<anonymous> (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/node/NodeWebSocketTransport.js:38:32)\n    at callListener (/app/node_modules/puppeteer-core/node_modules/ws/lib/event-target.js:290:14)\n    at WebSocket.onMessage (/app/node_modules/puppeteer-core/node_modules/ws/lib/event-target.js:209:9)\n    at WebSocket.emit (node:events:524:28)\n    at Receiver.receiverOnMessage (/app/node_modules/puppeteer-core/node_modules/ws/lib/websocket.js:1220:20)\n    at Receiver.emit (node:events:524:28)\n    at Immediate.<anonymous> (/app/node_modules/puppeteer-core/node_modules/ws/lib/receiver.js:601:16)"},"page":"https://gb64.com/oldsite/gameofweek/4/americanfeature/gotw_stripoker.htm","workerid":1}}

and later every page looks like this

{"timestamp":"2025-05-24T16:56:06.356Z","logLevel":"error","context":"fetch","message":"Direct fetch of page URL timed out","details":{"seconds":91,"page":"https://gb64.com/oldsite/gameofweek/5/americanfeature2/gotw_track&field.htm","workerid":1}}
{"timestamp":"2025-05-24T16:56:06.401Z","logLevel":"warn","context":"recorder","message":"Error getting cookies","details":{"page":"https://gb64.com/oldsite/gameofweek/5/americanfeature2/gotw_track&field.htm","e":{"name":"TargetCloseError","cause":{"name":"ProtocolError"}}}}

clydzik avatar May 24 '25 16:05 clydzik

Did you achieved to confirm it really blocks the URL you do not want to load by looking at WARC content? If yes, then you probably have something else causing a browser crash...

benoit74 avatar May 26 '25 09:05 benoit74

it seems so

{"timestamp":"2025-05-26T09:29:15.692Z","logLevel":"warn","context":"recorder","message":"Request failed","details":{"url":"https://cloud.cbm8bit.com/zzap/1200-plain_grey4.png","errorText":"net::ERR_BLOCKED_BY_CLIENT.Inspector","type":"Image","status":0,"page":"https://gb64.com/forum/index.php","workerid":0}}

but only second argument. when i put two of them:

--blockRules 'https://hg1.hitbox.com' --blockRules 'https://cloud.cbm8bit.com'

only second is passed (somehow overriden)

this is run command from logs:

INFO:Running browsertrix-crawler crawl: crawl --title gamebase 64 --description An attempt to document ALL Commodore 64 gameware before its too late --workers 2 --waitUntil domcontentloaded --depth 10 --pageLoadTimeout 91 --scopeExcludeRx .*oldsite.* --blockRules https://cloud.cbm8bit.com --behaviorTimeout 31 --saveState always --diskUtilization 90 --seeds https://gb64.com --userAgentSuffix +Zimit --cwd /output/.tmp23327r06

So not sure how to pass two urls besides cobining them into one regex (if this will work)

Anyway after crawlinkg 70k pages with excluding whole oldsite i still have error but seems to be different now a bit:

{"timestamp":"2025-05-26T06:51:18.467Z","logLevel":"info","context":"general","message":"Awaiting page load","details":{"page":"https://gb64.com/forum/uc
p.php?mode=login&redirect=viewtopic.php%3Fstart%3D45%26t%3D4718","workerid":0}}
{"timestamp":"2025-05-26T06:51:18.923Z","logLevel":"warn","context":"recorder","message":"Request failed","details":{"url":"https://cloud.cbm8bit.com/zza
p/gb64_forum_background.png","errorText":"net::ERR_BLOCKED_BY_CLIENT.Inspector","type":"Image","status":0,"page":"https://gb64.com/forum/ucp.php?mode=log
in&redirect=viewtopic.php%3Fstart%3D45%26t%3D4718","workerid":0}}
{"timestamp":"2025-05-26T06:51:18.944Z","logLevel":"warn","context":"recorder","message":"Request failed","details":{"url":"https://cloud.cbm8bit.com/zza
p/1200-plain_grey4.png","errorText":"net::ERR_BLOCKED_BY_CLIENT.Inspector","type":"Image","status":0,"page":"https://gb64.com/forum/ucp.php?mode=login&re
direct=viewtopic.php%3Fstart%3D45%26t%3D4718","workerid":0}}
{"timestamp":"2025-05-26T06:51:19.602Z","logLevel":"info","context":"pageStatus","message":"Page Finished","details":{"loadState":4,"page":"https://gb64.
com/forum/viewtopic.php?p=19228","workerid":1}}
{"timestamp":"2025-05-26T06:51:19.626Z","logLevel":"info","context":"worker","message":"Starting page","details":{"workerid":1,"page":"https://gb64.com/f
orum/viewtopic.php?t=4718&start=45&view=print"}}
{"timestamp":"2025-05-26T06:51:19.628Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":72806,"total":156432,
"pending":2,"failed":0,"limit":{"max":0,"hit":false},"pendingPages":["{\"seedId\":0,\"started\":\"2025-05-26T06:51:19.626Z\",\"extraHops\":0,\"url\":\"ht
tps:\\/\\/gb64.com\\/forum\\/viewtopic.php?t=4718&start=45&view=print\",\"added\":\"2025-05-25T03:15:39.085Z\",\"depth\":5}","{\"seedId\":0,\"started\":\
"2025-05-26T06:51:18.348Z\",\"extraHops\":0,\"url\":\"https:\\/\\/gb64.com\\/forum\\/ucp.php?mode=login&redirect=viewtopic.php%3Fstart%3D45%26t%3D4718\",
\"added\":\"2025-05-25T03:15:39.014Z\",\"depth\":5}"]}}
{"timestamp":"2025-05-26T06:51:19.727Z","logLevel":"info","context":"general","message":"Awaiting page load","details":{"page":"https://gb64.com/forum/vi
ewtopic.php?t=4718&start=45&view=print","workerid":1}}
{"timestamp":"2025-05-26T06:51:20.097Z","logLevel":"warn","context":"general","message":"Invalid Page - URL must start with http:// or https://","details
":{"url":"javascript:void(0);","page":"https://gb64.com/forum/ucp.php?mode=login&redirect=viewtopic.php%3Fstart%3D45%26t%3D4718","workerid":0}}
{"timestamp":"2025-05-26T06:51:27.777Z","logLevel":"error","context":"general","message":"Custom page load check timed out","details":{"seconds":5,"page"
:"https://gb64.com/forum/viewtopic.php?t=4718&start=45&view=print","workerid":1}}
{"timestamp":"2025-05-26T06:51:32.784Z","logLevel":"error","context":"general","message":"Link extraction timed out","details":{"seconds":5,"page":"https
://gb64.com/forum/viewtopic.php?t=4718&start=45&view=print","workerid":1}}
{"timestamp":"2025-05-26T06:51:37.792Z","logLevel":"error","context":"general","message":"Timed out getting page title, something is likely wrong","details":{"seconds":5,"page":"https://gb64.com/forum/viewtopic.php?t=4718&start=45&view=print","workerid":1}}
{"timestamp":"2025-05-26T06:51:51.182Z","logLevel":"warn","context":"behavior","message":"Behaviors timed out","details":{"seconds":31,"page":"https://gb64.com/forum/ucp.php?mode=login&redirect=viewtopic.php%3Fstart%3D45%26t%3D4718","workerid":0}}
{"timestamp":"2025-05-26T06:51:52.185Z","logLevel":"info","context":"pageStatus","message":"Page Finished","details":{"loadState":3,"page":"https://gb64.com/forum/ucp.php?mode=login&redirect=viewtopic.php%3Fstart%3D45%26t%3D4718","workerid":0}}
{"timestamp":"2025-05-26T06:51:52.212Z","logLevel":"info","context":"worker","message":"Starting page","details":{"workerid":0,"page":"https://gb64.com/forum/viewtopic.php?t=4718&start=75"}}
{"timestamp":"2025-05-26T06:51:52.215Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":72807,"total":156433,"pending":2,"failed":0,"limit":{"max":0,"hit":false},"pendingPages":["{\"seedId\":0,\"started\":\"2025-05-26T06:51:52.211Z\",\"extraHops\":0,\"url\":\"https:\\/\\/gb64.com\\/forum\\/viewtopic.php?t=4718&start=75\",\"added\":\"2025-05-25T03:15:39.095Z\",\"depth\":5}","{\"seedId\":0,\"started\":\"2025-05-26T06:51:19.626Z\",\"extraHops\":0,\"url\":\"https:\\/\\/gb64.com\\/forum\\/viewtopic.php?t=4718&start=45&view=print\",\"added\":\"2025-05-25T03:15:39.085Z\",\"depth\":5}"]}}
{"timestamp":"2025-05-26T06:53:23.217Z","logLevel":"error","context":"fetch","message":"Direct fetch of page URL timed out","details":{"seconds":91,"page":"https://gb64.com/forum/viewtopic.php?t=4718&start=75","workerid":0}}
{"timestamp":"2025-05-26T06:53:23.243Z","logLevel":"info","context":"worker","message":"Starting page","details":{"workerid":0,"page":"https://gb64.com/forum/viewtopic.php?p=19229"}}

so from above

{"timestamp":"2025-05-26T06:51:37.792Z","logLevel":"error","context":"general","message":"Timed out getting page title, something is likely wrong","details":{"seconds":5,"page":"https://gb64.com/forum/viewtopic.php?t=4718&start=45&view=print","workerid":1}}

after that i see there is no progress and crawl is constantly timeing out on pages without progress. also see that brave processes are doing nothing.

i will try to explore more of exluding urls and also play more with

--keep --saveState always

to be able to continue somehow after days of crawling really.

Not sure why this is so unstable with this page i try to get for offline...

clydzik avatar May 26 '25 11:05 clydzik

Late follow up. It seems that my problems come out from not headless runs. I tried to disable display-manager first and than it uses xvfb with headless buffer when it runs. This was still failing ocassionally. Later i discovered obvious flag --headless and it seems it do the trick. I did few runs that took few days and these completes successfully

clydzik avatar Jul 23 '25 10:07 clydzik

Thank you @clydzik for the follow-up. Doesn't it have other negative side-effects when running headless? I.e. does the ZIM works as expected? I don't remember exact details, but I feel like not running headless by default was an educated choice.

benoit74 avatar Jul 24 '25 07:07 benoit74

Hey. For now i didnt noticed any negative sideffect. Zim was created and looks right way when browsed Also will make more run also avoid exclusions i did before in this step (comment): https://github.com/openzim/zimit/issues/500#issuecomment-2900671519

And will post update on stability.

clydzik avatar Jul 25 '25 14:07 clydzik