crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

[Bug]: crawl breaks when log_console=True

Open tropxy opened this issue 1 year ago • 2 comments

crawl4ai version

0.4.248 and above

Expected Behavior

I expected logs to be printed in the stdoutput to not have an exception

Current Behavior

This is happening:

$ python web_crawler/crawl_page_mardown.py                                                                                                                                                                                                                                                                                                                                                                                                         

Starting authenticated crawl of 1 URLs
Successfully authenticated. Retrieved 2 cookies.
[INIT].... → Crawl4AI 0.4.248

[CONSOLE]. ℹ Console: [NOVA] Initiating Nova countdown...
[CONSOLE]. ℹ Console: Google Maps JavaScript API has been loaded directly without loading=async. This can result in suboptimal performance. For best-practice loading patterns please see https://goo.gle/js-api-loading
[CONSOLE]. ℹ Console: Access to XMLHttpRequest at 'https://maps.googleapis.com/maps/api/mapsjs/gen_204?csp_test=true' from origin 'https://poc18.demo.dev.charge.ampeco.tech' has been blocked by CORS policy: Response to preflight request doesn't pass access control check: No 'Access-Control-Allow-Origin' header is present on the requested resource.
[CONSOLE]. ℹ Console: Failed to load resource: net::ERR_FAILED
[CONSOLE]. ℹ Console: Element with name "gmp-internal-element-support-verification" already defined.
[CONSOLE]. ℹ Console: You have included the Google Maps JavaScript API multiple times on this page. This may cause unexpected errors.
[CONSOLE]. ℹ Console: Google Maps JavaScript API has been loaded directly without loading=async. This can result in suboptimal performance. For best-practice loading patterns please see https://goo.gle/js-api-loading
[CONSOLE]. ℹ Console: Element with name "gmp-internal-use-place-details" already defined.
[CONSOLE]. ℹ Console: Element with name "gmp-map" already defined.
[CONSOLE]. ℹ Console: Access to XMLHttpRequest at 'https://maps.googleapis.com/maps/api/mapsjs/gen_204?csp_test=true' from origin 'https://poc18.demo.dev.charge.ampeco.tech' has been blocked by CORS policy: Response to preflight request doesn't pass access control check: No 'Access-Control-Allow-Origin' header is present on the requested resource.
[CONSOLE]. ℹ Console: Failed to load resource: net::ERR_FAILED
[CONSOLE]. ℹ Console: [NOVA] We have lift off!
[CONSOLE]. ℹ Console: [NOVA] All systems go...
[CONSOLE]. ℹ Console: [NOVA] Syncing Inertia props to the store...
Error occurred in event listener
Traceback (most recent call last):
  File "/Users/andre/.virtualenvs/ai_sly_ops/lib/python3.11/site-packages/pyee/asyncio.py", line 77, in _emit_run
    coro: Any = f(*args, **kwargs)
                ^^^^^^^^^^^^^^^^^^
  File "/Users/andre/.virtualenvs/ai_sly_ops/lib/python3.11/site-packages/playwright/_impl/_browser_context.py", line 170, in <lambda>
    lambda params: self._on_page_error(
                   ^^^^^^^^^^^^^^^^^^^^
  File "/Users/andre/.virtualenvs/ai_sly_ops/lib/python3.11/site-packages/playwright/_impl/_browser_context.py", line 700, in _on_page_error
    page.emit(Page.Events.PageError, error)
  File "/Users/andre/.virtualenvs/ai_sly_ops/lib/python3.11/site-packages/pyee/asyncio.py", line 68, in emit
    return super().emit(event, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/andre/.virtualenvs/ai_sly_ops/lib/python3.11/site-packages/pyee/base.py", line 212, in emit
    handled = self._call_handlers(event, args, kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/andre/.virtualenvs/ai_sly_ops/lib/python3.11/site-packages/pyee/base.py", line 188, in _call_handlers
    self._emit_run(f, args, kwargs)
  File "/Users/andre/.virtualenvs/ai_sly_ops/lib/python3.11/site-packages/pyee/asyncio.py", line 79, in _emit_run
    self.emit("error", exc)
  File "/Users/andre/.virtualenvs/ai_sly_ops/lib/python3.11/site-packages/pyee/asyncio.py", line 68, in emit
    return super().emit(event, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/andre/.virtualenvs/ai_sly_ops/lib/python3.11/site-packages/pyee/base.py", line 215, in emit
    self._emit_handle_potential_error(event, args[0] if args else None)
  File "/Users/andre/.virtualenvs/ai_sly_ops/lib/python3.11/site-packages/pyee/base.py", line 173, in _emit_handle_potential_error
    raise error
  File "/Users/andre/.virtualenvs/ai_sly_ops/lib/python3.11/site-packages/pyee/asyncio.py", line 77, in _emit_run
    coro: Any = f(*args, **kwargs)
                ^^^^^^^^^^^^^^^^^^
  File "/Users/andre/.virtualenvs/ai_sly_ops/lib/python3.11/site-packages/playwright/_impl/_impl_to_api_mapping.py", line 123, in wrapper_func
    return handler(
           ^^^^^^^^
  File "/Users/andre/.virtualenvs/ai_sly_ops/lib/python3.11/site-packages/crawl4ai/async_crawler_strategy.py", line 1317, in <lambda>
    page.on("pageerror", lambda e: log_consol(e, "error"))
                                   ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/andre/.virtualenvs/ai_sly_ops/lib/python3.11/site-packages/crawl4ai/async_crawler_strategy.py", line 1307, in log_consol
    params={"msg": msg.text},
                   ^^^^^^^^
AttributeError: 'Error' object has no attribute 'text'
[ERROR]... × https://poc18.demo.dev.charge.ampeco.tech/admin/re... | Error:
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ × Unexpected error in _crawl_web at line 528 in wrap_api_call (../../.virtualenvs/ai_sly_ops/lib/python3.11/site-     │
│ packages/playwright/_impl/_connection.py):                                                                            │
│   Error: Page.wait_for_selector: 'Error' object has no attribute 'text'                                               │
│                                                                                                                       │
│   Code context:                                                                                                       │
│   523           parsed_st = _extract_stack_trace_information_from_stack(st, is_internal)                              │
│   524           self._api_zone.set(parsed_st)                                                                         │
│   525           try:                                                                                                  │
│   526               return await cb()                                                                                 │
│   527           except Exception as error:                                                                            │
│   528 →             raise rewrite_error(error, f"{parsed_st['apiName']}: {error}") from None                          │
│   529           finally:                                                                                              │
│   530               self._api_zone.set(None)                                                                          │
│   531                                                                                                                 │
│   532       def wrap_api_call_sync(                                                                                   │
│   533           self, cb: Callable[[], Any], is_internal: bool = False                                                │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Failed crawling https://poc18.demo.dev.charge.ampeco.tech/admin/resources/charge-points: Unexpected error in _crawl_web at line 528 in wrap_api_call (../../.virtualenvs/ai_sly_ops/lib/python3.11/site-packages/playwright/_impl/_connection.py):
Error: Page.wait_for_selector: 'Error' object has no attribute 'text'

Code context:
 523           parsed_st = _extract_stack_trace_information_from_stack(st, is_internal)
 524           self._api_zone.set(parsed_st)
 525           try:
 526               return await cb()
 527           except Exception as error:
 528 →             raise rewrite_error(error, f"{parsed_st['apiName']}: {error}") from None
 529           finally:
 530               self._api_zone.set(None)
 531
 532       def wrap_api_call_sync(
 533           self, cb: Callable[[], Any], is_internal: bool = False

Is this reproducible?

Yes

Inputs Causing the Bug

`log_console=True` in the `CrawlerRunConfig`

Steps to Reproduce

Use the following code and crawl over the page. This seems to happen over the specific page I am crawling over. (I can provide the HTML of the page in confidence if needed)

Note that crawling is successful if log_console=False

import os
import json
import asyncio
import logging_config
import logging
import re
import datetime
from typing import List, Dict, Tuple, Optional
from playwright.async_api import async_playwright

from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
logger = logging.getLogger( __name__)
logger.setLevel(logging.DEBUG)

## Cookies


async def get_login_cookies() -> List[dict]:
    """Authenticate using Playwright and get full browser state"""
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context()
        page = await context.new_page()

        try:
            await page.goto(LOGIN_URL, wait_until="networkidle", timeout=15000)
            await page.fill('input[name="email"]', LOGIN_USER)
            await page.fill('input[name="password"]', LOGIN_PASS)
            login_btn = page.locator('button.flex:has-text("Login")')
            await login_btn.wait_for(timeout=10000)
            await login_btn.click()
            await page.wait_for_url(DASHBOARD_URL, timeout=15000)

            cookies = await context.cookies()
            logger.info(f"Authenticated. Retrieved cookies {cookies}.")
            await browser.close()
            return cookies

        except Exception as e:
            await browser.close()
            logger.error(f"Login failed: {str(e)}")
            raise


## Crawler


async def crawl_url(url: str, max_concurrent: int = 10):
    """Main crawling function with proper authentication"""
    try:
        cookies = await get_login_cookies()
        logger.info(f"Successfully authenticated. Retrieved {len(cookies)} cookies.")
    except Exception as e:
        logger.error(f"Authentication failed: {str(e)}")
        return

    browser_config = BrowserConfig(
        headless=True,
        verbose=True,
        extra_args=[
            "--disable-gpu",
            "--no-sandbox",
            "--disable-dev-shm-usage",
            "--disable-setuid-sandbox",
            "--disable-images",
            "--disable-fonts",
        ],
        cookies=cookies,
        java_script_enabled=True,
        user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
    )

    async with AsyncWebCrawler(config=browser_config) as crawler:
        try:
            validated_url = url
            logger.info(f"Processing {validated_url}")

            result = await crawler.arun(
                url=validated_url,
                config=CrawlerRunConfig(
                    cache_mode=CacheMode.BYPASS,
                    wait_until="networkidle",
                    page_timeout=60000,
                    scan_full_page=True,
                    verbose=True,
                    log_console=True,
                ),
                timeout=90,
            )

            if result.success:
                logger.info(f"Successfully crawled {validated_url}")
                logger.debug(f"Markdown content: {result.markdown_v2.raw_markdown}")
            else:
                logger.error(
                    f"Failed crawling {validated_url}: {result.error_message}"
                )
        except Exception as e:
            paint(f"Fatal error processing {validated_url}: {str(e)}")

async def main():
    URL = "https://..." # user must change the URL for a real one
    logger.info(f"Starting the crawl for {URL}")
    await crawl_url(url=URL)
  
if __name__ == "__main__":
    asyncio.run(main())

Code snippets

Code in Steps to reproduce

OS

macOS: Darwin MacBook-Pro.local 24.1.0 Darwin Kernel Version 24.1.0: Thu Oct 10 21:03:15 PDT 2024; root:xnu-11215.41.3~2/RELEASE_ARM64_T6000 arm64

Python version

3.11

Browser

irrelevant

Browser version

irrelevant

Error logs & Screenshots (if applicable)

tropxy avatar Mar 07 '25 20:03 tropxy

@tropxy I'm not able to reproduce this error. Could you share your full code (for eg: I'm not able to see which URL you used or how you did imports).

Try upgrading 0.5 and try once.

Based on your logs, I don't think the crawl breaks because of log_console=True, but it is breaking due to some other reason, and the resulting exception which the logger is unable to handle correctly. So basically the crawl is breaking due to some other reason, and triggering another error in exception handling.

aravindkarnam avatar Mar 10 '25 10:03 aravindkarnam

@tropxy I'm not able to reproduce this error. Could you share your full code (for eg: I'm not able to see which URL you used or how you did imports).

Try upgrading 0.5 and try once.

Based on your logs, I don't think the crawl breaks because of log_console=True, but it is breaking due to some other reason, and the resulting exception which the logger is unable to handle correctly. So basically the crawl is breaking due to some other reason, and triggering another error in exception handling.

Hi @aravindkarnam thanks for the reply. I have extended the code shared to include the imports and the main function.

In regards to the URL, this is a not a public page and I cant publicly share details of it, so, I see 2 options:

  1. Would it work if I provide you directly the raw HTML of the relevant page or any other data that I can get from the browser dev tools?
  2. If you give me your email, I could temporarily add you to the backend users so you can test it yourself.

Safer for me would be 1., but if it would be complicated that way, then we can go with 2.

tropxy avatar Mar 15 '25 13:03 tropxy