[Bug]: crawl breaks when log_console=True
crawl4ai version
0.4.248 and above
Expected Behavior
I expected logs to be printed in the stdoutput to not have an exception
Current Behavior
This is happening:
$ python web_crawler/crawl_page_mardown.py
Starting authenticated crawl of 1 URLs
Successfully authenticated. Retrieved 2 cookies.
[INIT].... → Crawl4AI 0.4.248
[CONSOLE]. ℹ Console: [NOVA] Initiating Nova countdown...
[CONSOLE]. ℹ Console: Google Maps JavaScript API has been loaded directly without loading=async. This can result in suboptimal performance. For best-practice loading patterns please see https://goo.gle/js-api-loading
[CONSOLE]. ℹ Console: Access to XMLHttpRequest at 'https://maps.googleapis.com/maps/api/mapsjs/gen_204?csp_test=true' from origin 'https://poc18.demo.dev.charge.ampeco.tech' has been blocked by CORS policy: Response to preflight request doesn't pass access control check: No 'Access-Control-Allow-Origin' header is present on the requested resource.
[CONSOLE]. ℹ Console: Failed to load resource: net::ERR_FAILED
[CONSOLE]. ℹ Console: Element with name "gmp-internal-element-support-verification" already defined.
[CONSOLE]. ℹ Console: You have included the Google Maps JavaScript API multiple times on this page. This may cause unexpected errors.
[CONSOLE]. ℹ Console: Google Maps JavaScript API has been loaded directly without loading=async. This can result in suboptimal performance. For best-practice loading patterns please see https://goo.gle/js-api-loading
[CONSOLE]. ℹ Console: Element with name "gmp-internal-use-place-details" already defined.
[CONSOLE]. ℹ Console: Element with name "gmp-map" already defined.
[CONSOLE]. ℹ Console: Access to XMLHttpRequest at 'https://maps.googleapis.com/maps/api/mapsjs/gen_204?csp_test=true' from origin 'https://poc18.demo.dev.charge.ampeco.tech' has been blocked by CORS policy: Response to preflight request doesn't pass access control check: No 'Access-Control-Allow-Origin' header is present on the requested resource.
[CONSOLE]. ℹ Console: Failed to load resource: net::ERR_FAILED
[CONSOLE]. ℹ Console: [NOVA] We have lift off!
[CONSOLE]. ℹ Console: [NOVA] All systems go...
[CONSOLE]. ℹ Console: [NOVA] Syncing Inertia props to the store...
Error occurred in event listener
Traceback (most recent call last):
File "/Users/andre/.virtualenvs/ai_sly_ops/lib/python3.11/site-packages/pyee/asyncio.py", line 77, in _emit_run
coro: Any = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/Users/andre/.virtualenvs/ai_sly_ops/lib/python3.11/site-packages/playwright/_impl/_browser_context.py", line 170, in <lambda>
lambda params: self._on_page_error(
^^^^^^^^^^^^^^^^^^^^
File "/Users/andre/.virtualenvs/ai_sly_ops/lib/python3.11/site-packages/playwright/_impl/_browser_context.py", line 700, in _on_page_error
page.emit(Page.Events.PageError, error)
File "/Users/andre/.virtualenvs/ai_sly_ops/lib/python3.11/site-packages/pyee/asyncio.py", line 68, in emit
return super().emit(event, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/andre/.virtualenvs/ai_sly_ops/lib/python3.11/site-packages/pyee/base.py", line 212, in emit
handled = self._call_handlers(event, args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/andre/.virtualenvs/ai_sly_ops/lib/python3.11/site-packages/pyee/base.py", line 188, in _call_handlers
self._emit_run(f, args, kwargs)
File "/Users/andre/.virtualenvs/ai_sly_ops/lib/python3.11/site-packages/pyee/asyncio.py", line 79, in _emit_run
self.emit("error", exc)
File "/Users/andre/.virtualenvs/ai_sly_ops/lib/python3.11/site-packages/pyee/asyncio.py", line 68, in emit
return super().emit(event, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/andre/.virtualenvs/ai_sly_ops/lib/python3.11/site-packages/pyee/base.py", line 215, in emit
self._emit_handle_potential_error(event, args[0] if args else None)
File "/Users/andre/.virtualenvs/ai_sly_ops/lib/python3.11/site-packages/pyee/base.py", line 173, in _emit_handle_potential_error
raise error
File "/Users/andre/.virtualenvs/ai_sly_ops/lib/python3.11/site-packages/pyee/asyncio.py", line 77, in _emit_run
coro: Any = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/Users/andre/.virtualenvs/ai_sly_ops/lib/python3.11/site-packages/playwright/_impl/_impl_to_api_mapping.py", line 123, in wrapper_func
return handler(
^^^^^^^^
File "/Users/andre/.virtualenvs/ai_sly_ops/lib/python3.11/site-packages/crawl4ai/async_crawler_strategy.py", line 1317, in <lambda>
page.on("pageerror", lambda e: log_consol(e, "error"))
^^^^^^^^^^^^^^^^^^^^^^
File "/Users/andre/.virtualenvs/ai_sly_ops/lib/python3.11/site-packages/crawl4ai/async_crawler_strategy.py", line 1307, in log_consol
params={"msg": msg.text},
^^^^^^^^
AttributeError: 'Error' object has no attribute 'text'
[ERROR]... × https://poc18.demo.dev.charge.ampeco.tech/admin/re... | Error:
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ × Unexpected error in _crawl_web at line 528 in wrap_api_call (../../.virtualenvs/ai_sly_ops/lib/python3.11/site- │
│ packages/playwright/_impl/_connection.py): │
│ Error: Page.wait_for_selector: 'Error' object has no attribute 'text' │
│ │
│ Code context: │
│ 523 parsed_st = _extract_stack_trace_information_from_stack(st, is_internal) │
│ 524 self._api_zone.set(parsed_st) │
│ 525 try: │
│ 526 return await cb() │
│ 527 except Exception as error: │
│ 528 → raise rewrite_error(error, f"{parsed_st['apiName']}: {error}") from None │
│ 529 finally: │
│ 530 self._api_zone.set(None) │
│ 531 │
│ 532 def wrap_api_call_sync( │
│ 533 self, cb: Callable[[], Any], is_internal: bool = False │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Failed crawling https://poc18.demo.dev.charge.ampeco.tech/admin/resources/charge-points: Unexpected error in _crawl_web at line 528 in wrap_api_call (../../.virtualenvs/ai_sly_ops/lib/python3.11/site-packages/playwright/_impl/_connection.py):
Error: Page.wait_for_selector: 'Error' object has no attribute 'text'
Code context:
523 parsed_st = _extract_stack_trace_information_from_stack(st, is_internal)
524 self._api_zone.set(parsed_st)
525 try:
526 return await cb()
527 except Exception as error:
528 → raise rewrite_error(error, f"{parsed_st['apiName']}: {error}") from None
529 finally:
530 self._api_zone.set(None)
531
532 def wrap_api_call_sync(
533 self, cb: Callable[[], Any], is_internal: bool = False
Is this reproducible?
Yes
Inputs Causing the Bug
`log_console=True` in the `CrawlerRunConfig`
Steps to Reproduce
Use the following code and crawl over the page. This seems to happen over the specific page I am crawling over. (I can provide the HTML of the page in confidence if needed)
Note that crawling is successful if log_console=False
import os
import json
import asyncio
import logging_config
import logging
import re
import datetime
from typing import List, Dict, Tuple, Optional
from playwright.async_api import async_playwright
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
logger = logging.getLogger( __name__)
logger.setLevel(logging.DEBUG)
## Cookies
async def get_login_cookies() -> List[dict]:
"""Authenticate using Playwright and get full browser state"""
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context()
page = await context.new_page()
try:
await page.goto(LOGIN_URL, wait_until="networkidle", timeout=15000)
await page.fill('input[name="email"]', LOGIN_USER)
await page.fill('input[name="password"]', LOGIN_PASS)
login_btn = page.locator('button.flex:has-text("Login")')
await login_btn.wait_for(timeout=10000)
await login_btn.click()
await page.wait_for_url(DASHBOARD_URL, timeout=15000)
cookies = await context.cookies()
logger.info(f"Authenticated. Retrieved cookies {cookies}.")
await browser.close()
return cookies
except Exception as e:
await browser.close()
logger.error(f"Login failed: {str(e)}")
raise
## Crawler
async def crawl_url(url: str, max_concurrent: int = 10):
"""Main crawling function with proper authentication"""
try:
cookies = await get_login_cookies()
logger.info(f"Successfully authenticated. Retrieved {len(cookies)} cookies.")
except Exception as e:
logger.error(f"Authentication failed: {str(e)}")
return
browser_config = BrowserConfig(
headless=True,
verbose=True,
extra_args=[
"--disable-gpu",
"--no-sandbox",
"--disable-dev-shm-usage",
"--disable-setuid-sandbox",
"--disable-images",
"--disable-fonts",
],
cookies=cookies,
java_script_enabled=True,
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
)
async with AsyncWebCrawler(config=browser_config) as crawler:
try:
validated_url = url
logger.info(f"Processing {validated_url}")
result = await crawler.arun(
url=validated_url,
config=CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
wait_until="networkidle",
page_timeout=60000,
scan_full_page=True,
verbose=True,
log_console=True,
),
timeout=90,
)
if result.success:
logger.info(f"Successfully crawled {validated_url}")
logger.debug(f"Markdown content: {result.markdown_v2.raw_markdown}")
else:
logger.error(
f"Failed crawling {validated_url}: {result.error_message}"
)
except Exception as e:
paint(f"Fatal error processing {validated_url}: {str(e)}")
async def main():
URL = "https://..." # user must change the URL for a real one
logger.info(f"Starting the crawl for {URL}")
await crawl_url(url=URL)
if __name__ == "__main__":
asyncio.run(main())
Code snippets
Code in Steps to reproduce
OS
macOS: Darwin MacBook-Pro.local 24.1.0 Darwin Kernel Version 24.1.0: Thu Oct 10 21:03:15 PDT 2024; root:xnu-11215.41.3~2/RELEASE_ARM64_T6000 arm64
Python version
3.11
Browser
irrelevant
Browser version
irrelevant
Error logs & Screenshots (if applicable)
@tropxy I'm not able to reproduce this error. Could you share your full code (for eg: I'm not able to see which URL you used or how you did imports).
Try upgrading 0.5 and try once.
Based on your logs, I don't think the crawl breaks because of log_console=True, but it is breaking due to some other reason, and the resulting exception which the logger is unable to handle correctly. So basically the crawl is breaking due to some other reason, and triggering another error in exception handling.
@tropxy I'm not able to reproduce this error. Could you share your full code (for eg: I'm not able to see which URL you used or how you did imports).
Try upgrading 0.5 and try once.
Based on your logs, I don't think the crawl breaks because of log_console=True, but it is breaking due to some other reason, and the resulting exception which the logger is unable to handle correctly. So basically the crawl is breaking due to some other reason, and triggering another error in exception handling.
Hi @aravindkarnam thanks for the reply. I have extended the code shared to include the imports and the main function.
In regards to the URL, this is a not a public page and I cant publicly share details of it, so, I see 2 options:
- Would it work if I provide you directly the raw HTML of the relevant page or any other data that I can get from the browser dev tools?
- If you give me your email, I could temporarily add you to the backend users so you can test it yourself.
Safer for me would be 1., but if it would be complicated that way, then we can go with 2.