crawlee icon indicating copy to clipboard operation
crawlee copied to clipboard

feat: implement ErrorSnapshotter for error context capture

Open HamzaAlwan opened this issue 1 year ago • 13 comments

This commit introduces the ErrorSnapshotter class to the crawlee package, providing functionality to capture screenshots and HTML snapshots when an error occurs during web crawling.

Closes #2280

HamzaAlwan avatar Feb 12 '24 17:02 HamzaAlwan

Hey @B4nan,

I have completed the pull request that resolves the issue with saving Save screenshot/HTML on first occurrence of error in error statistics.

However, when I was trying things out, I ran into a few hiccups. I used the debugger and set up a quick test file with Cheerio and the browser to check if everything was working fine. The good news is that the object passed the test, but I couldn't get around to testing the file saved in the key-value store properly.

Normal tests ran fine, but seems there are errors but I don't believe they are related to my changes.

Unfortunately, I was unable to run the e2e tests or build it due to some errors in the terminal, and I wasn't able to figure out what caused them.

I'm using node v18.17.1 and yarn v4.0.2

yarn test Error:

 FAIL  test/core/crawlers/browser_crawler.test.ts > BrowserCrawler > proxy > browser should launch with rotated custom proxy
 FAIL  test/core/crawlers/browser_crawler.test.ts > BrowserCrawler > proxy > browser should launch with rotated custom proxy
 FAIL  test/core/crawlers/browser_crawler.test.ts > BrowserCrawler > proxy > browser should launch with rotated custom proxy
 FAIL  test/core/crawlers/browser_crawler.test.ts > BrowserCrawler > proxy > browser should launch with rotated custom proxy
 FAIL  test/core/crawlers/browser_crawler.test.ts > BrowserCrawler > proxy > proxy rotation on error respects maxSessionRotations, calls failedRequestHandler
 FAIL  test/core/crawlers/browser_crawler.test.ts > BrowserCrawler > proxy > proxy rotation on error respects maxSessionRotations, calls failedRequestHandler
 FAIL  test/core/crawlers/browser_crawler.test.ts > BrowserCrawler > proxy > proxy rotation on error respects maxSessionRotations, calls failedRequestHandler
 FAIL  test/core/crawlers/browser_crawler.test.ts > BrowserCrawler > proxy > proxy rotation on error respects maxSessionRotations, calls failedRequestHandler
 FAIL  test/core/crawlers/browser_crawler.test.ts > BrowserCrawler > proxy > proxy rotation logs the original proxy error
 FAIL  test/core/crawlers/browser_crawler.test.ts > BrowserCrawler > proxy > proxy rotation logs the original proxy error
 FAIL  test/core/crawlers/browser_crawler.test.ts > BrowserCrawler > proxy > proxy rotation logs the original proxy error
 FAIL  test/core/crawlers/browser_crawler.test.ts > BrowserCrawler > proxy > proxy rotation logs the original proxy error
TargetCloseError: Protocol error (Runtime.callFunctionOn): Session closed. Most likely the page has been closed.
 ❯ CdpCDPSession.send node_modules/puppeteer-core/src/cdp/CDPSession.ts:106:6
 ❯ ExecutionContext.#evaluate node_modules/puppeteer-core/src/cdp/ExecutionContext.ts:292:6
 ❯ ExecutionContext.evaluate node_modules/puppeteer-core/lib/esm/puppeteer/cdp/ExecutionContext.js:124:36
    125|      * @example
    126|      *
    127|      * ```ts
       |      ^
    128|      * const context = await page.mainFrame().executionContext();
    129|      * const handle: JSHandle<typeof globalThis> = await context.evaluateHandle(
 ❯ IsolatedWorld.evaluate node_modules/puppeteer-core/src/cdp/IsolatedWorld.ts:153:14
 ❯ CdpFrame.evaluate node_modules/puppeteer-core/src/api/Frame.ts:513:30
 ❯ CdpFrame.<anonymous> node_modules/puppeteer-core/src/util/decorators.ts:64:2
 ❯ CdpFrame.content node_modules/puppeteer-core/src/api/Frame.ts:788:10
 ❯ CdpFrame.<anonymous> node_modules/puppeteer-core/src/util/decorators.ts:64:2
 ❯ CdpPage.content node_modules/puppeteer-core/lib/esm/puppeteer/api/Page.js:516:43
 ❯ ErrorSnapshotter.captureSnapshot packages/utils/src/internals/error_snapshotter.ts:38:38

yarn build Error:

@crawlee/utils#build: command (/Users/hamza/Desktop/apify/crawlee/packages/utils) /private/var/folders/4z/qgr81fxx5695006fwrzq6k7r0000gn/T/xfs-504fbf48/yarn run build exited (2)

 Tasks:    3 successful, 5 total
Cached:    3 cached, 5 total
  Time:    3.881s 
Failed:    @crawlee/utils#build

 ERROR  run failed: command  exited (2)

yarn test:e2e Error:

node:internal/errors:496
    ErrorCaptureStackTrace(err);
    ^

Error [ERR_MODULE_NOT_FOUND]: Cannot find module '/Users/hamza/Desktop/apify/crawlee/packages/utils/dist/index.mjs' imported from /Users/hamza/Desktop/apify/crawlee/test/e2e/tools.mjs
    at new NodeError (node:internal/errors:405:5)
    at finalizeResolution (node:internal/modules/esm/resolve:324:11)
    at moduleResolve (node:internal/modules/esm/resolve:943:10)
    at defaultResolve (node:internal/modules/esm/resolve:1129:11)
    at nextResolve (node:internal/modules/esm/loader:163:28)
    at ESMLoader.resolve (node:internal/modules/esm/loader:835:30)
    at ESMLoader.getModuleJob (node:internal/modules/esm/loader:424:18)
    at ModuleWrap.<anonymous> (node:internal/modules/esm/module_job:77:40)
    at link (node:internal/modules/esm/module_job:76:36) {
  code: 'ERR_MODULE_NOT_FOUND'
}

Node.js v18.17.1

HamzaAlwan avatar Feb 12 '24 17:02 HamzaAlwan

Can you send the full build log you have from the build command? There should be more information 👀

vladfrangu avatar Feb 12 '24 18:02 vladfrangu

@vladfrangu

Can you send the full build log you have from the build command? There should be more information 👀

Full build log:

yarn build 
╭───────────────────────────────────────────────────────────────────────╮
│                                                                       │
│                  Update available v1.11.3 ≫ v1.12.3                   │
│    Changelog: https://github.com/vercel/turbo/releases/tag/v1.12.3    │
│           Run "npx @turbo/codemod@latest update" to update            │
│                                                                       │
│        Follow @turborepo for updates: https://x.com/turborepo         │
╰───────────────────────────────────────────────────────────────────────╯
• Packages in scope: @crawlee/basic, @crawlee/browser, @crawlee/browser-pool, @crawlee/cheerio, @crawlee/cli, @crawlee/core, @crawlee/http, @crawlee/jsdom, @crawlee/linkedom, @crawlee/memory-storage, @crawlee/playwright, @crawlee/puppeteer, @crawlee/templates, @crawlee/types, @crawlee/utils, crawlee
• Running build in 16 packages
• Remote caching disabled
@crawlee/types:build: cache hit (outputs already on disk), replaying logs d5696cf6c1be5593
@crawlee/templates:build: cache hit (outputs already on disk), replaying logs 5de49aba4a4af3e3
@crawlee/templates:build: Debugger attached.
@crawlee/templates:build: Debugger attached.
@crawlee/templates:build: Debugger attached.
@crawlee/templates:build: Waiting for the debugger to disconnect...
@crawlee/templates:build: Waiting for the debugger to disconnect...
@crawlee/templates:build: Debugger attached.
@crawlee/templates:build: Debugger attached.
@crawlee/templates:build: Validating 8 templates
@crawlee/templates:build: Validating template getting-started-ts
@crawlee/templates:build: Finished validating getting-started-ts
@crawlee/templates:build: Validating template getting-started-js
@crawlee/templates:build: Finished validating getting-started-js
@crawlee/templates:build: Validating template cheerio-ts
@crawlee/types:build: Debugger attached.
@crawlee/types:build: Debugger attached.
@crawlee/types:build: Debugger attached.
@crawlee/types:build: Waiting for the debugger to disconnect...
@crawlee/types:build: Waiting for the debugger to disconnect...
@crawlee/types:build: Debugger attached.
@crawlee/types:build: Debugger attached.
@crawlee/types:build: Waiting for the debugger to disconnect...
@crawlee/types:build: Debugger attached.
@crawlee/types:build: Waiting for the debugger to disconnect...
@crawlee/types:build: Waiting for the debugger to disconnect...
@crawlee/types:build: Debugger attached.
@crawlee/types:build: Debugger attached.
@crawlee/types:build: Debugger attached.
@crawlee/types:build: Waiting for the debugger to disconnect...
@crawlee/types:build: Waiting for the debugger to disconnect...
@crawlee/types:build: Waiting for the debugger to disconnect...
@crawlee/types:build: Waiting for the debugger to disconnect...
@crawlee/templates:build: Finished validating cheerio-ts
@crawlee/templates:build: Validating template playwright-ts
@crawlee/templates:build: Finished validating playwright-ts
@crawlee/templates:build: Validating template puppeteer-ts
@crawlee/templates:build: Finished validating puppeteer-ts
@crawlee/templates:build: Validating template cheerio-js
@crawlee/templates:build: Finished validating cheerio-js
@crawlee/templates:build: Validating template playwright-js
@crawlee/templates:build: Finished validating playwright-js
@crawlee/templates:build: Validating template puppeteer-js
@crawlee/templates:build: Finished validating puppeteer-js
@crawlee/templates:build: Waiting for the debugger to disconnect...
@crawlee/templates:build: Waiting for the debugger to disconnect...
@crawlee/templates:build: Debugger attached.
@crawlee/templates:build: Debugger attached.
@crawlee/templates:build: Waiting for the debugger to disconnect...
@crawlee/templates:build: Debugger attached.
@crawlee/templates:build: Waiting for the debugger to disconnect...
@crawlee/templates:build: Waiting for the debugger to disconnect...
@crawlee/templates:build: Debugger attached.
@crawlee/templates:build: Debugger attached.
@crawlee/templates:build: Debugger attached.
@crawlee/templates:build: Waiting for the debugger to disconnect...
@crawlee/templates:build: Waiting for the debugger to disconnect...
@crawlee/templates:build: Waiting for the debugger to disconnect...
@crawlee/templates:build: Waiting for the debugger to disconnect...
@crawlee/utils:build: cache miss, executing 4ba9529396006646
@crawlee/cli:build: cache hit (outputs already on disk), replaying logs 05dc6b290eb9c6bb
@crawlee/memory-storage:build: cache miss, executing a3934740eaa87094
@crawlee/utils:build: error TS2209: The project root is ambiguous, but is required to resolve export map entry '.' in file '/Users/hamza/Desktop/apify/crawlee/packages/utils/package.json'. Supply the `rootDir` compiler option to disambiguate.
@crawlee/utils:build: 
@crawlee/utils:build: 
@crawlee/utils:build: Found 1 error.
@crawlee/utils:build: 
@crawlee/utils:build: ERROR: command finished with error: command (/Users/hamza/Desktop/apify/crawlee/packages/utils) /private/var/folders/4z/qgr81fxx5695006fwrzq6k7r0000gn/T/xfs-fada239d/yarn run build exited (2)
@crawlee/utils#build: command (/Users/hamza/Desktop/apify/crawlee/packages/utils) /private/var/folders/4z/qgr81fxx5695006fwrzq6k7r0000gn/T/xfs-fada239d/yarn run build exited (2)

 Tasks:    4 successful, 5 total
Cached:    3 cached, 5 total
  Time:    5.055s 
Failed:    @crawlee/utils#build

 ERROR  run failed: command  exited (2)

HamzaAlwan avatar Feb 12 '24 18:02 HamzaAlwan

Just checked CI logs too, and yeah.. probably updating imports with what I sent might fix it

vladfrangu avatar Feb 12 '24 18:02 vladfrangu

Just checked CI logs too, and yeah.. probably updating imports with what I sent might fix it

Yes, it's fixed now.

HamzaAlwan avatar Feb 12 '24 18:02 HamzaAlwan

A couple of things:

1- I need to handle saving the local file path to the screenshot and HTML, currently, I only save the Apify path which will not work locally. 2- Please verify the naming or the path of the screenshot and HTML properties, and the method that generates the file names. 3- Not sure if the way I handled saving the error is the correct way, as we have errorTracker and errorTrackerRetry. 4- Should we save the screenshot and HTML inside the default key-value store? 5- Regarding this:

Some errors happen before any page is created or opened, before navigation happens, or after the page is already closed (maybe then a response object is still available to store HTML?)

It's not possible to save the HTML or take a screenshot if an error happens before navigation, and I have not yet looked at the case for post-navigation

HamzaAlwan avatar Feb 12 '24 23:02 HamzaAlwan

Please add some tests for the new functionality.

Also, note that what you did (making the add method async) is in fact breaking change, but I guess we can afford to do that here as the class is rather internal (but it's not marked as such and is publically documented). Maybe it would be better to add another method, e.g. addAsync.

B4nan avatar Feb 13 '24 07:02 B4nan

Thanks. I will check it a bit later this week

metalwarrior665 avatar Feb 13 '24 08:02 metalwarrior665

@metalwarrior665 ,

Here is an example, clicking on the (path + cmd) opens the file:

"retryErrors": {
    "file:///Users/hamza/Desktop/playwright-test/main.js:18:19": {
      "missing error code": {
        "Error": {
          "This is an error": {
            "count": 3,
            "firstErrorScreenshot": "file:///Users/hamza/Desktop/playwright-test/storage/key_value_stores/default/SNAPSHOT_de2cd33186d663537f2b441be5c372_This-is-an-error.jpeg",
            "firstErrorHtml": "file:///Users/hamza/Desktop/playwright-test/storage/key_value_stores/default/SNAPSHOT_de2cd33186d663537f2b441be5c372_This-is-an-error.html"
          }
        }
      }
    }
  }

HamzaAlwan avatar Feb 20 '24 09:02 HamzaAlwan

@HamzaAlwan When adding tests, let's add a few real-world errors so we can properly trim very long error messages. I wonder if the 30 chars error length could be longer but let's see how the real world errors will look like. Thanks.

metalwarrior665 avatar Feb 21 '24 08:02 metalwarrior665

Everything is done, please look it over and provide any feedback.

What is left is this:

We should be able to capture snapshots of e.g. pages that navigated but then were rejected as blocked by status code. Those should already have a body and perhaps the page object

This code works for browsers because we can access the page in the context. However, it throws an error without returning the response ref1 and ref2. I'm unsure if we want to change the code's behavior, as doing so could cause further issues, and it may also be another breaking change.

HamzaAlwan avatar Feb 23 '24 15:02 HamzaAlwan

@B4nan @vladfrangu @barjin Hey guys, let's get this merged pls. It is really useful for us and we believe for the community as well. We are ready to refactor it if needed. Thanks

metalwarrior665 avatar Apr 11 '24 11:04 metalwarrior665

@HamzaAlwan Please look at @B4nan 's remarks. Let's prioritize this so we can finish it finally. cc @AndreyBykov

metalwarrior665 avatar May 09 '24 10:05 metalwarrior665

Great that you addressed the comments just now @HamzaAlwan, we want to ship v3.10 in the afternoon most probably, so let's try to squeeze this in. Don't forget to rebase and resolve the conflicts.

B4nan avatar May 15 '24 06:05 B4nan

@HamzaAlwan please try to work on this asap if you want to see this in v3.10.0

B4nan avatar May 15 '24 07:05 B4nan

I might be late to the party, but why ErrorSnapshotter and not ErrorScreenshotter or something? "Snapshot" is pretty ambiguous.

janbuchar avatar May 15 '24 09:05 janbuchar

it does not store just screenshots, but also html snapshots, right?

edit: in fact it does not store any screenshots when used with non-browser crawlers

B4nan avatar May 15 '24 09:05 B4nan

We use the term snapshot as "store representation of the page" like screenshot, HTML (potentially HAR , JS etc. if we would want that). I agree the term is confusing but we didn't find better one. We already have https://crawlee.dev/api/3.7/playwright-crawler/namespace/playwrightUtils#saveSnapshot

metalwarrior665 avatar May 15 '24 09:05 metalwarrior665