browsertrix-crawler icon indicating copy to clipboard operation
browsertrix-crawler copied to clipboard

Parameter --failOnFailedSeed exits Docker with ExitCode 0

Open gitreich opened this issue 9 months ago • 5 comments

I made up 4 URLs (not existing in the www) and wrote it into a seeds.txt file I started btrix-crawler 1.1.1 once with --failOnFailedSeed once without But in both cases Docker was quit with Exit Code 0 but in the log File you can see the crawl has failed after the first seed So thats correct, but the exit Code should have been 1 when I follow the instructions here: https://crawler.docs.browsertrix.com/user-guide/cli-options/#crawler

LogFile with --failOnFailedSeed {"timestamp":"2024-05-06T10:43:15.176Z","logLevel":"info","context":"general","message":"Browsertrix-Crawler 1.1.1 (with warcio.js 2.2.1)","details":{}} {"timestamp":"2024-05-06T10:43:15.178Z","logLevel":"info","context":"general","message":"Seeds","details":[{"url":"https://1234nix.at/","scopeType":"prefix","include":["/^https?:\/\/1234nix\.at\//"],"exclude":[],"allowHash":false,"depth":-1,"sitemap":null,"maxExtraHops":0,"maxDepth":3},{"url":"https://faili.com/","scopeType":"prefix","include":["/^https?:\/\/faili\.com\//"],"exclude":[],"allowHash":false,"depth":-1,"sitemap":null,"maxExtraHops":0,"maxDepth":3},{"url":"https://nix.com/","scopeType":"prefix","include":["/^https?:\/\/nix\.com\//"],"exclude":[],"allowHash":false,"depth":-1,"sitemap":null,"maxExtraHops":0,"maxDepth":3},{"url":"https://failhere.at/","scopeType":"prefix","include":["/^https?:\/\/failhere\.at\//"],"exclude":[],"allowHash":false,"depth":-1,"sitemap":null,"maxExtraHops":0,"maxDepth":3}]} {"timestamp":"2024-05-06T10:43:15.232Z","logLevel":"info","context":"healthcheck","message":"Healthcheck server started on 9825","details":{}} {"timestamp":"2024-05-06T10:43:15.984Z","logLevel":"info","context":"worker","message":"Creating 1 workers","details":{}} {"timestamp":"2024-05-06T10:43:15.985Z","logLevel":"info","context":"worker","message":"Worker starting","details":{"workerid":0}} {"timestamp":"2024-05-06T10:43:16.200Z","logLevel":"info","context":"worker","message":"Starting page","details":{"workerid":0,"page":"https://1234nix.at/"}} {"timestamp":"2024-05-06T10:43:16.202Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":0,"total":4,"pending":1,"failed":0,"limit":{"max":5,"hit":false},"pendingPages":["{"seedId":0,"started":"2024-05-06T10:43:15.988Z","extraHops":0,"url":"https://1234nix.at/","added":"2024-05-06T10:43:15.311Z","depth":0}"]}} {"timestamp":"2024-05-06T10:43:16.318Z","logLevel":"info","context":"general","message":"Awaiting page load","details":{"page":"https://1234nix.at/","workerid":0}} {"timestamp":"2024-05-06T10:43:16.355Z","logLevel":"warn","context":"recorder","message":"Request failed","details":{"url":"https://1234nix.at/","errorText":"net::ERR_NAME_NOT_RESOLVED","page":"https://1234nix.at/","workerid":0}} {"timestamp":"2024-05-06T10:43:16.358Z","logLevel":"fatal","context":"general","message":"Page Load Timeout, failing crawl. Quitting","details":{"msg":"net::ERR_NAME_NOT_RESOLVED at https://1234nix.at/","page":"https://1234nix.at/","workerid":0}}

Docker Process: e5208a09e9b0 webrecorder/browsertrix-crawler:1.1.1 "/docker-entrypoint.…" 10 minutes ago Exited (0) 10 minutes ago ONB_Btrix_invalid_urls_20240506124313

gitreich avatar May 06 '24 10:05 gitreich

Hi @gitreich! Oops, I think this is a regression that was introduced when we started allowing 4/5xx status codes for pages rather than throwing an error or moving on the next page in the 1.x release of the crawler. But you're right, if --failOnFailedSeed is enabled, a 4/5xx response for a seed should still fail the crawl with an exit code of 1! I'll submit a patch for that this week.

tw4l avatar May 06 '24 14:05 tw4l

Ah, I see what happened. We added a --failOnInvalidStatus option and were checking that to see if the crawl should fail when a page returned 4xx/5xx, but if --failOnFailedSeed is set and a seed is 4xx/5xx we should fail the crawl regardless of that setting. Modifying the code and tests now.

tw4l avatar May 06 '24 17:05 tw4l

In the meantime, can you confirm if you run the crawl with both --failOnFailedSeed and --failOnInvalidStatus that it works as you'd expect? (i.e. quits with exit code of 1)

We used to just always throw an error on 4xx/5xx responses, but changed that behavior for 1.x

tw4l avatar May 06 '24 20:05 tw4l

I started:

docker run -d --name ONB_Btrix_invalid_urls_20240507090214 -e NODE_OPTIONS="--max-old-space-size=32768" -p 9397:9397 -p 12157:12157 -v /home/antares/Schreibtisch/Docker/browsertrix/crawls/:/crawls/ webrecorder/browsertrix-crawler:1.1.1 crawl --screencastPort 9397 --seedFile /crawls/config/invalid_urls_seeds.txt --scopeType prefix --depth 3 --extraHops 0 --workers 1 --healthCheckPort 12157 --headless --restartsOnError --failOnFailedSeed --failOnInvalidStatus --delay 1 --waitUntil networkidle0 --saveState always --logging stats,info --warcInfo ONB_CRAWL_invalid_urls_Depth_3_20240507090214 --userAgentSuffix +ONB_Bot_Btrix_1.1.1, [email protected] --crawlId id_ONB_CRAWL_invalid_urls_Depth_3_20240507090214 --collection invalid_urls_20240507090214

but the result was like before: The crawl was failing as expected on the first seed but with ExitCode 0 (expected: 1) from Docker

Log file:

{"timestamp":"2024-05-07T07:02:16.368Z","logLevel":"info","context":"general","message":"Browsertrix-Crawler 1.1.1 (with warcio.js 2.2.1)","details":{}} {"timestamp":"2024-05-07T07:02:16.371Z","logLevel":"info","context":"general","message":"Seeds","details":[{"url":"https://1234nix.at/","scopeType":"prefix","include":["/^https?:\/\/1234nix\.at\//"],"exclude":[],"allowHash":false,"depth":-1,"sitemap":null,"maxExtraHops":0,"maxDepth":3},{"url":"https://faili.com/","scopeType":"prefix","include":["/^https?:\/\/faili\.com\//"],"exclude":[],"allowHash":false,"depth":-1,"sitemap":null,"maxExtraHops":0,"maxDepth":3},{"url":"https://nix.com/","scopeType":"prefix","include":["/^https?:\/\/nix\.com\//"],"exclude":[],"allowHash":false,"depth":-1,"sitemap":null,"maxExtraHops":0,"maxDepth":3},{"url":"https://failhere.at/","scopeType":"prefix","include":["/^https?:\/\/failhere\.at\//"],"exclude":[],"allowHash":false,"depth":-1,"sitemap":null,"maxExtraHops":0,"maxDepth":3}]} {"timestamp":"2024-05-07T07:02:16.653Z","logLevel":"info","context":"healthcheck","message":"Healthcheck server started on 12157","details":{}} {"timestamp":"2024-05-07T07:02:18.077Z","logLevel":"info","context":"worker","message":"Creating 1 workers","details":{}} {"timestamp":"2024-05-07T07:02:18.078Z","logLevel":"info","context":"worker","message":"Worker starting","details":{"workerid":0}} {"timestamp":"2024-05-07T07:02:18.328Z","logLevel":"info","context":"worker","message":"Starting page","details":{"workerid":0,"page":"https://1234nix.at/"}} {"timestamp":"2024-05-07T07:02:18.331Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":0,"total":4,"pending":1,"failed":0,"limit":{"max":0,"hit":false},"pendingPages":["{"seedId":0,"started":"2024-05-07T07:02:18.081Z","extraHops":0,"url":"https://1234nix.at/","added":"2024-05-07T07:02:16.726Z","depth":0}"]}} {"timestamp":"2024-05-07T07:02:18.469Z","logLevel":"info","context":"general","message":"Awaiting page load","details":{"page":"https://1234nix.at/","workerid":0}} {"timestamp":"2024-05-07T07:02:18.502Z","logLevel":"warn","context":"recorder","message":"Request failed","details":{"url":"https://1234nix.at/","errorText":"net::ERR_NAME_NOT_RESOLVED","page":"https://1234nix.at/","workerid":0}} {"timestamp":"2024-05-07T07:02:18.505Z","logLevel":"fatal","context":"general","message":"Page Load Timeout, failing crawl. Quitting","details":{"msg":"net::ERR_NAME_NOT_RESOLVED at https://1234nix.at/","page":"https://1234nix.at/","workerid":0}}

Docker ps -a 4d0c0864837d webrecorder/browsertrix-crawler:1.1.1 "/docker-entrypoint.…" 18 seconds ago Exited (0) 14 seconds ago ONB_Btrix_invalid_urls_20240507090214

gitreich avatar May 07 '24 07:05 gitreich

Without --restartOnFail I receive ExitCode 17

docker run -d --name ONB_Btrix_invalid_urls_20240507090634 -e NODE_OPTIONS="--max-old-space-size=32768" -p 9961:9961 -p 13181:13181 -v /home/antares/Schreibtisch/Docker/browsertrix/crawls/:/crawls/ webrecorder/browsertrix-crawler:1.1.1 crawl --screencastPort 9961 --seedFile /crawls/config/invalid_urls_seeds.txt --scopeType prefix --depth 3 --extraHops 0 --workers 1 --healthCheckPort 13181 --headless --failOnFailedSeed --failOnInvalidStatus --delay 1 --waitUntil networkidle0 --saveState always --logging stats,info --warcInfo ONB_CRAWL_invalid_urls_Depth_3_20240507090634 --userAgentSuffix +ONB_Bot_Btrix_1.1.1, [email protected] --crawlId id_ONB_CRAWL_invalid_urls_Depth_3_20240507090634 --collection invalid_urls_20240507090634

fbbbc7113739 webrecorder/browsertrix-crawler:1.1.1 "/docker-entrypoint.…" 25 seconds ago Exited (17) 22 seconds ago ONB_Btrix_invalid_urls_20240507090634

gitreich avatar May 07 '24 07:05 gitreich

Ah, got it! When the URL wasn't reachable due to DNS not resolving, the crawler was falling back to the default fatal status code of 17. Putting in a patch for that shortly with updated tests.

tw4l avatar May 08 '24 17:05 tw4l

With 1.1.2 I receive ExitCode 1 for invalid seeds with parameter --failOnFailedSeed Great!

Only thing I am not sure is --failOnInvalidStatus alone ( not combined with failOnFailedSeed) Case URL does not exist: ExitCode: 0 - every non existing URL was visited Case 500 Response: ExitCode: 0 - every Seed visited Case 404 Response: Exit Code: 0 - every Seed visited All 3 cases the seed result was failed (for all given seeds) but Docker Exit Code was 0 But when I look into the docs it seems, like combination with other parameter is needed to force exit Code other then 0 so I tried combine it with --failOnFailedLimit 1 which was not working for me Opened #575

gitreich avatar May 16 '24 08:05 gitreich

Hi @gitreich, glad to hear this is working with --failOnFailedSeed now!

For the other behavior you're describing, I think that may be as expected. In the 1.x releases, if --failOnInvalidStatus is not set, the crawler doesn't consider 4xx/5xx responses failures, so they wouldn't trigger --failOnFailedLimit. And without --failOnFailedSeed, we allow seeds to fail/be non-existent as long as at least one of them resolves to something (even if it's a 404 page).

This is a change from 0.x but allows us to be more flexible and precise with what behavior is expected.

tw4l avatar May 16 '24 14:05 tw4l