browsertrix-crawler
browsertrix-crawler copied to clipboard
Parameter --failOnFailedSeed exits Docker with ExitCode 0
I made up 4 URLs (not existing in the www) and wrote it into a seeds.txt file I started btrix-crawler 1.1.1 once with --failOnFailedSeed once without But in both cases Docker was quit with Exit Code 0 but in the log File you can see the crawl has failed after the first seed So thats correct, but the exit Code should have been 1 when I follow the instructions here: https://crawler.docs.browsertrix.com/user-guide/cli-options/#crawler
LogFile with --failOnFailedSeed {"timestamp":"2024-05-06T10:43:15.176Z","logLevel":"info","context":"general","message":"Browsertrix-Crawler 1.1.1 (with warcio.js 2.2.1)","details":{}} {"timestamp":"2024-05-06T10:43:15.178Z","logLevel":"info","context":"general","message":"Seeds","details":[{"url":"https://1234nix.at/","scopeType":"prefix","include":["/^https?:\/\/1234nix\.at\//"],"exclude":[],"allowHash":false,"depth":-1,"sitemap":null,"maxExtraHops":0,"maxDepth":3},{"url":"https://faili.com/","scopeType":"prefix","include":["/^https?:\/\/faili\.com\//"],"exclude":[],"allowHash":false,"depth":-1,"sitemap":null,"maxExtraHops":0,"maxDepth":3},{"url":"https://nix.com/","scopeType":"prefix","include":["/^https?:\/\/nix\.com\//"],"exclude":[],"allowHash":false,"depth":-1,"sitemap":null,"maxExtraHops":0,"maxDepth":3},{"url":"https://failhere.at/","scopeType":"prefix","include":["/^https?:\/\/failhere\.at\//"],"exclude":[],"allowHash":false,"depth":-1,"sitemap":null,"maxExtraHops":0,"maxDepth":3}]} {"timestamp":"2024-05-06T10:43:15.232Z","logLevel":"info","context":"healthcheck","message":"Healthcheck server started on 9825","details":{}} {"timestamp":"2024-05-06T10:43:15.984Z","logLevel":"info","context":"worker","message":"Creating 1 workers","details":{}} {"timestamp":"2024-05-06T10:43:15.985Z","logLevel":"info","context":"worker","message":"Worker starting","details":{"workerid":0}} {"timestamp":"2024-05-06T10:43:16.200Z","logLevel":"info","context":"worker","message":"Starting page","details":{"workerid":0,"page":"https://1234nix.at/"}} {"timestamp":"2024-05-06T10:43:16.202Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":0,"total":4,"pending":1,"failed":0,"limit":{"max":5,"hit":false},"pendingPages":["{"seedId":0,"started":"2024-05-06T10:43:15.988Z","extraHops":0,"url":"https://1234nix.at/","added":"2024-05-06T10:43:15.311Z","depth":0}"]}} {"timestamp":"2024-05-06T10:43:16.318Z","logLevel":"info","context":"general","message":"Awaiting page load","details":{"page":"https://1234nix.at/","workerid":0}} {"timestamp":"2024-05-06T10:43:16.355Z","logLevel":"warn","context":"recorder","message":"Request failed","details":{"url":"https://1234nix.at/","errorText":"net::ERR_NAME_NOT_RESOLVED","page":"https://1234nix.at/","workerid":0}} {"timestamp":"2024-05-06T10:43:16.358Z","logLevel":"fatal","context":"general","message":"Page Load Timeout, failing crawl. Quitting","details":{"msg":"net::ERR_NAME_NOT_RESOLVED at https://1234nix.at/","page":"https://1234nix.at/","workerid":0}}
Docker Process: e5208a09e9b0 webrecorder/browsertrix-crawler:1.1.1 "/docker-entrypoint.…" 10 minutes ago Exited (0) 10 minutes ago ONB_Btrix_invalid_urls_20240506124313
Hi @gitreich! Oops, I think this is a regression that was introduced when we started allowing 4/5xx status codes for pages rather than throwing an error or moving on the next page in the 1.x release of the crawler. But you're right, if --failOnFailedSeed
is enabled, a 4/5xx response for a seed should still fail the crawl with an exit code of 1! I'll submit a patch for that this week.
Ah, I see what happened. We added a --failOnInvalidStatus
option and were checking that to see if the crawl should fail when a page returned 4xx/5xx, but if --failOnFailedSeed
is set and a seed is 4xx/5xx we should fail the crawl regardless of that setting. Modifying the code and tests now.
In the meantime, can you confirm if you run the crawl with both --failOnFailedSeed
and --failOnInvalidStatus
that it works as you'd expect? (i.e. quits with exit code of 1)
We used to just always throw an error on 4xx/5xx responses, but changed that behavior for 1.x
I started:
docker run -d --name ONB_Btrix_invalid_urls_20240507090214 -e NODE_OPTIONS="--max-old-space-size=32768" -p 9397:9397 -p 12157:12157 -v /home/antares/Schreibtisch/Docker/browsertrix/crawls/:/crawls/ webrecorder/browsertrix-crawler:1.1.1 crawl --screencastPort 9397 --seedFile /crawls/config/invalid_urls_seeds.txt --scopeType prefix --depth 3 --extraHops 0 --workers 1 --healthCheckPort 12157 --headless --restartsOnError --failOnFailedSeed --failOnInvalidStatus --delay 1 --waitUntil networkidle0 --saveState always --logging stats,info --warcInfo ONB_CRAWL_invalid_urls_Depth_3_20240507090214 --userAgentSuffix +ONB_Bot_Btrix_1.1.1, [email protected] --crawlId id_ONB_CRAWL_invalid_urls_Depth_3_20240507090214 --collection invalid_urls_20240507090214
but the result was like before: The crawl was failing as expected on the first seed but with ExitCode 0 (expected: 1) from Docker
Log file:
{"timestamp":"2024-05-07T07:02:16.368Z","logLevel":"info","context":"general","message":"Browsertrix-Crawler 1.1.1 (with warcio.js 2.2.1)","details":{}} {"timestamp":"2024-05-07T07:02:16.371Z","logLevel":"info","context":"general","message":"Seeds","details":[{"url":"https://1234nix.at/","scopeType":"prefix","include":["/^https?:\/\/1234nix\.at\//"],"exclude":[],"allowHash":false,"depth":-1,"sitemap":null,"maxExtraHops":0,"maxDepth":3},{"url":"https://faili.com/","scopeType":"prefix","include":["/^https?:\/\/faili\.com\//"],"exclude":[],"allowHash":false,"depth":-1,"sitemap":null,"maxExtraHops":0,"maxDepth":3},{"url":"https://nix.com/","scopeType":"prefix","include":["/^https?:\/\/nix\.com\//"],"exclude":[],"allowHash":false,"depth":-1,"sitemap":null,"maxExtraHops":0,"maxDepth":3},{"url":"https://failhere.at/","scopeType":"prefix","include":["/^https?:\/\/failhere\.at\//"],"exclude":[],"allowHash":false,"depth":-1,"sitemap":null,"maxExtraHops":0,"maxDepth":3}]} {"timestamp":"2024-05-07T07:02:16.653Z","logLevel":"info","context":"healthcheck","message":"Healthcheck server started on 12157","details":{}} {"timestamp":"2024-05-07T07:02:18.077Z","logLevel":"info","context":"worker","message":"Creating 1 workers","details":{}} {"timestamp":"2024-05-07T07:02:18.078Z","logLevel":"info","context":"worker","message":"Worker starting","details":{"workerid":0}} {"timestamp":"2024-05-07T07:02:18.328Z","logLevel":"info","context":"worker","message":"Starting page","details":{"workerid":0,"page":"https://1234nix.at/"}} {"timestamp":"2024-05-07T07:02:18.331Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":0,"total":4,"pending":1,"failed":0,"limit":{"max":0,"hit":false},"pendingPages":["{"seedId":0,"started":"2024-05-07T07:02:18.081Z","extraHops":0,"url":"https://1234nix.at/","added":"2024-05-07T07:02:16.726Z","depth":0}"]}} {"timestamp":"2024-05-07T07:02:18.469Z","logLevel":"info","context":"general","message":"Awaiting page load","details":{"page":"https://1234nix.at/","workerid":0}} {"timestamp":"2024-05-07T07:02:18.502Z","logLevel":"warn","context":"recorder","message":"Request failed","details":{"url":"https://1234nix.at/","errorText":"net::ERR_NAME_NOT_RESOLVED","page":"https://1234nix.at/","workerid":0}} {"timestamp":"2024-05-07T07:02:18.505Z","logLevel":"fatal","context":"general","message":"Page Load Timeout, failing crawl. Quitting","details":{"msg":"net::ERR_NAME_NOT_RESOLVED at https://1234nix.at/","page":"https://1234nix.at/","workerid":0}}
Docker ps -a 4d0c0864837d webrecorder/browsertrix-crawler:1.1.1 "/docker-entrypoint.…" 18 seconds ago Exited (0) 14 seconds ago ONB_Btrix_invalid_urls_20240507090214
Without --restartOnFail I receive ExitCode 17
docker run -d --name ONB_Btrix_invalid_urls_20240507090634 -e NODE_OPTIONS="--max-old-space-size=32768" -p 9961:9961 -p 13181:13181 -v /home/antares/Schreibtisch/Docker/browsertrix/crawls/:/crawls/ webrecorder/browsertrix-crawler:1.1.1 crawl --screencastPort 9961 --seedFile /crawls/config/invalid_urls_seeds.txt --scopeType prefix --depth 3 --extraHops 0 --workers 1 --healthCheckPort 13181 --headless --failOnFailedSeed --failOnInvalidStatus --delay 1 --waitUntil networkidle0 --saveState always --logging stats,info --warcInfo ONB_CRAWL_invalid_urls_Depth_3_20240507090634 --userAgentSuffix +ONB_Bot_Btrix_1.1.1, [email protected] --crawlId id_ONB_CRAWL_invalid_urls_Depth_3_20240507090634 --collection invalid_urls_20240507090634
fbbbc7113739 webrecorder/browsertrix-crawler:1.1.1 "/docker-entrypoint.…" 25 seconds ago Exited (17) 22 seconds ago ONB_Btrix_invalid_urls_20240507090634
Ah, got it! When the URL wasn't reachable due to DNS not resolving, the crawler was falling back to the default fatal status code of 17. Putting in a patch for that shortly with updated tests.
With 1.1.2 I receive ExitCode 1 for invalid seeds with parameter --failOnFailedSeed Great!
Only thing I am not sure is --failOnInvalidStatus alone ( not combined with failOnFailedSeed) Case URL does not exist: ExitCode: 0 - every non existing URL was visited Case 500 Response: ExitCode: 0 - every Seed visited Case 404 Response: Exit Code: 0 - every Seed visited All 3 cases the seed result was failed (for all given seeds) but Docker Exit Code was 0 But when I look into the docs it seems, like combination with other parameter is needed to force exit Code other then 0 so I tried combine it with --failOnFailedLimit 1 which was not working for me Opened #575
Hi @gitreich, glad to hear this is working with --failOnFailedSeed
now!
For the other behavior you're describing, I think that may be as expected. In the 1.x releases, if --failOnInvalidStatus
is not set, the crawler doesn't consider 4xx/5xx responses failures, so they wouldn't trigger --failOnFailedLimit
. And without --failOnFailedSeed
, we allow seeds to fail/be non-existent as long as at least one of them resolves to something (even if it's a 404 page).
This is a change from 0.x but allows us to be more flexible and precise with what behavior is expected.