crawl4ai
crawl4ai copied to clipboard
[Bug]: Get the same results for different URLs and Failed on navigating ACS-GOTO with multiple urls crawling
crawl4ai version
0.6.3
Expected Behavior
Getting the data of each link correctly and not getting the Failed on navigating ACS-GOTO with multiple URLs crawl.
I tried using a different cache_mode but the problems still exist, If I put it CacheMode.ENABLED I get less time the Failed on navigating ACS-GOTO error but I can still happen -- much less.
Current Behavior
I am getting the same results even though the URLs are different! This is not always the case, and it doesn't happen for all URLs. If you look at the last two outputs, the URLs are different, but the data is the same. However, if I use one URL instead of multiple URLs, I don't have this problem, and the data is coming correctly.
Also, I am getting the Failed on navigating ACS-GOTO error (Not always, it's random... sometimes getting an error, sometimes it is fine) when I crawl multiple URLs (Not issue if I do a single crawl)
Is this reproducible?
Yes
Inputs Causing the Bug
Steps to Reproduce
Code snippets
import os
import asyncio
import re
import time
from typing import List
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, CrawlerMonitor, DisplayMode, \
GeolocationConfig, RegexExtractionStrategy
from crawl4ai.async_dispatcher import MemoryAdaptiveDispatcher, RateLimiter
from typing import Optional, Dict
async def crawl_parallel(urls: List[str]):
"""
Parallel crawling while reusing browser instance - best for large workloads
"""
print("\n=== Parallel Crawling with Browser Reuse ===")
browser_config = BrowserConfig(
browser_mode="builtin",
headless=True,
# browser_type="chromium",
use_managed_browser=True, # Enable advanced browser management features
use_persistent_context=True, # Persist the browser session across runs
# viewport={
# "width": 1920,
# "height": 1080,
# },
# user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
extra_args=["--disable-dev-shm-usage", "--disable-extensions"],
user_agent_mode="random",
user_agent_generator_config={
"platforms": ["mobile"],
"os": ["Android"], #Android, iOS
"browsers": ["Chrome Mobile"], #Chrome Mobile, Chrome Mobile iOS
},
text_mode=True, # Disable images and heavy content for faster page loads
light_mode=True, # Enable a minimalistic browser setup for performance gains
ignore_https_errors=True, # Ignore HTTPS certificate errors if True
user_data_dir="/home/Artur/.crawl4ai/profiles/my-login-profile", # Directory to store persistent browser data
# chrome_channel="chrome",
# channel="chrome",
)
crawl_config = CrawlerRunConfig(
markdown_generator=DefaultMarkdownGenerator(
# content_filter=PruningContentFilter(), In case you need fit_markdown
),
locale="ca-AD", # Accept-Language & UI locale set to Spanish (Spain)
timezone_id="Europe/Andorra", # JS Date()/Intl timezone set to Madrid, Spain
geolocation=GeolocationConfig( # override GPS coords with coordinates for Madrid
latitude=42.5077902,
longitude=1.5210902,
accuracy=10.0, # accuracy in meters
),
js_code=[
# accept_button_script
],
js_only=False,
word_count_threshold=3, #
excluded_tags=["nav", "footer"],
cache_mode=CacheMode.BYPASS, #
screenshot=False,
stream=False, #
wait_for_images=False, # Wait for images to load before capturing the final HTML snapshot
wait_for=None, # CSS selector or JavaScript condition to wait for before capturing the HTML
delay_before_return_html=0.1, # Additional delay (in seconds) before returning the final HTML content
# page_timeout=60000,
mean_delay = 0.5, #
max_range = 0.9, #
semaphore_count = 3, #
scroll_delay = 0.4, # Delay between scroll actions during full-page scanning
scan_full_page=False, # Automatically scroll the entire page to load dynamic content
process_iframes=False, #
remove_overlay_elements=True, # Remove overlay elements (e.g., popups, cookie banners) after page load
magic=True, # Enable advanced overlay handling
simulate_user=True, # Simulate user interactions to avoid bot detection
override_navigator=True, # Override navigator properties for stealth purposes
capture_mhtml=False, #
image_score_threshold=1, #
exclude_internal_links=True,
exclude_external_images= True, #
exclude_external_links=True,
check_robots_txt=True, #
session_id="persistent_session",
keep_data_attributes=False,
only_text=True,
# css_selector=".main-content",
remove_forms=True,
)
dispatcher = MemoryAdaptiveDispatcher(
rate_limiter=None,
max_session_permit=len(urls), # allow all URLs concurrently
memory_threshold_percent = 95.0,
# monitor=CrawlerMonitor(),
)
start_time = time.perf_counter()
async with AsyncWebCrawler(config=browser_config) as crawler:
result_container = await crawler.arun_many(urls=urls, config=crawl_config, dispatcher=dispatcher, extraction_strategy=None)
results = []
if isinstance(result_container, list):
results = result_container
else:
async for res in result_container:
results.append(res)
for result in results:
if result.success:
info = extract_product_info(result)
print(info)
# print(result.markdown)
# print(result.cleaned_html)
# print("Extraction JSON:", result.extracted_content)
# print(f"Successfully crawled: {result.url}")
# print(f"Title: {result.metadata.get('title', 'N/A')}")
# print(f"Description: {result.metadata.get('description', 'N/A')}")
# print(result.markdown)
# print(f"Word count: {len(result.markdown.split())}")
# print(f"Number of internal links: {len(result.links.get('internal', []))}")
# print(f"Number of external links: {len(result.links.get('external', []))}")
# print(f"Number of images: {len(result.media.get('images', []))}")
# print(f"Image links: {result.media.get('images', [])}")
# print(f"Number of videos: {len(result.metadata.get('videos', []))}")
print("---")
# if result.screenshot:
# from base64 import b64decode
# with open(os.path.join(__location__, "screenshot2.png"), "wb") as f:
# f.write(b64decode(result.screenshot))
else:
print(f"Failed to crawl: {result.url}")
print(f"Error: {result.error_message}")
print("---")
total_time = time.perf_counter() - start_time
return total_time, results
async def main():
urls = [
"https://www.pyrenees.ad/alimentacio/es/bebe/alimentacion-infantil/leches/hipp-leche-continuacion-combiotik-2-800g",
"https://www.pyrenees.ad/alimentacio/es/dietetica/productos-para-celiacos/cerveza/ambar-cerveza-00-sgluten-botella-33cl"
"https://www.pyrenees.ad/alimentacio/es/lacteos-y-huevos/huevos/grandes-l-63-a-73gr/el-meu-ou-huevos-l-retractil-12u",
"https://www.pyrenees.ad/alimentacio/es/frutas-y-verduras/frutas/fruta/manzana-golden-kg",
"https://www.pyrenees.ad/alimentacio/es/desayunos-dulces-y-pan/bombones/con-licor/maxims-paris-bombones-20u-200g",
"https://www.pyrenees.ad/alimentacio/es/lacteos-y-huevos/quesos/tierno-semicurado-y-curado/1605-herencia-4m-manchego-dop-1kg",
"https://www.pyrenees.ad/alimentacio/es/platos-preparados/sushi/sushi/913-tartar-salmon-200g",
]
await crawl_parallel(urls)
if __name__ == "__main__":
asyncio.run(main())
OS
Linux
Python version
3.12
Browser
Chrome
Browser version
No response
Error logs & Screenshots (if applicable)
=== Parallel Crawling with Browser Reuse ===
[BROWSER]. ℹ pre-launch cleanup failed: Command '[['lsof', '-t', '-i:9222']]' returned non-zero exit status 1.
[INIT].... → Crawl4AI 0.6.3
[ERROR]... Ă— https://www.pyrenees.ad...paris-bombones-20u-200g | Error: Unexpected error in _crawl_web at line 744 in _crawl_web
(../../../../anaconda3/envs/crawl/lib/python3.12/site-packages/crawl4ai/async_crawler_strategy.py):
Error: Failed on navigating ACS-GOTO:
Page.goto: net::ERR_ABORTED at https://www.pyrenees.ad/alimentacio/es/desayunos-dulces-y-pan/bombones/con-licor/maxims-paris-bombones-20u-200g
Call log:
- navigating to "https://www.pyrenees.ad/alimentacio/es/desayunos-dulces-y-pan/bombones/con-licor/maxims-paris-bombones-20u-200g", waiting until "domcontentloaded"
Code context:
739 response = await page.goto(
740 url, wait_until=config.wait_until, timeout=config.page_timeout
741 )
742 redirected_url = page.url
743 except Error as e:
744 → raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{str(e)}")
745
746 await self.execute_hook(
747 "after_goto", page, context=context, url=url, response=response, config=config
748 )
749
[ERROR]... Ă— https://www.pyrenees.ad...-huevos-l-retractil-12u | Error: Unexpected error in _crawl_web at line 744 in _crawl_web
(../../../../anaconda3/envs/crawl/lib/python3.12/site-packages/crawl4ai/async_crawler_strategy.py):
Error: Failed on navigating ACS-GOTO:
Page.goto: net::ERR_ABORTED at
https://www.pyrenees.ad/alimentacio/es/dietetica/productos-para-celiacos/cerveza/ambar-cerveza-00-sgluten-botella-33clhttps://www.pyrenees.ad/alimentacio/es/lacteos-y-huevos/huevos/grandes-l-63-a-73gr/el-meu-ou-huevos-l
-retractil-12u
Call log:
- navigating to
"https://www.pyrenees.ad/alimentacio/es/dietetica/productos-para-celiacos/cerveza/ambar-cerveza-00-sgluten-botella-33clhttps://www.pyrenees.ad/alimentacio/es/lacteos-y-huevos/huevos/grandes-l-63-a-73gr/el-meu-ou-huevos-
l-retractil-12u", waiting until "domcontentloaded"
Code context:
739 response = await page.goto(
740 url, wait_until=config.wait_until, timeout=config.page_timeout
741 )
742 redirected_url = page.url
743 except Error as e:
744 → raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{str(e)}")
745
746 await self.execute_hook(
747 "after_goto", page, context=context, url=url, response=response, config=config
748 )
749
[ERROR]... Ă— https://www.pyrenees.ad...fruta/manzana-golden-kg | Error: Unexpected error in _crawl_web at line 744 in _crawl_web
(../../../../anaconda3/envs/crawl/lib/python3.12/site-packages/crawl4ai/async_crawler_strategy.py):
Error: Failed on navigating ACS-GOTO:
Page.goto: net::ERR_ABORTED at https://www.pyrenees.ad/alimentacio/es/frutas-y-verduras/frutas/fruta/manzana-golden-kg
Call log:
- navigating to "https://www.pyrenees.ad/alimentacio/es/frutas-y-verduras/frutas/fruta/manzana-golden-kg", waiting until "domcontentloaded"
Code context:
739 response = await page.goto(
740 url, wait_until=config.wait_until, timeout=config.page_timeout
741 )
742 redirected_url = page.url
743 except Error as e:
744 → raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{str(e)}")
745
746 await self.execute_hook(
747 "after_goto", page, context=context, url=url, response=response, config=config
748 )
749
[ERROR]... Ă— https://www.pyrenees.ad...cia-4m-manchego-dop-1kg | Error: Unexpected error in _crawl_web at line 744 in _crawl_web
(../../../../anaconda3/envs/crawl/lib/python3.12/site-packages/crawl4ai/async_crawler_strategy.py):
Error: Failed on navigating ACS-GOTO:
Page.goto: net::ERR_ABORTED at https://www.pyrenees.ad/alimentacio/es/lacteos-y-huevos/quesos/tierno-semicurado-y-curado/1605-herencia-4m-manchego-dop-1kg
Call log:
- navigating to "https://www.pyrenees.ad/alimentacio/es/lacteos-y-huevos/quesos/tierno-semicurado-y-curado/1605-herencia-4m-manchego-dop-1kg", waiting until "domcontentloaded"
Code context:
739 response = await page.goto(
740 url, wait_until=config.wait_until, timeout=config.page_timeout
741 )
742 redirected_url = page.url
743 except Error as e:
744 → raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{str(e)}")
745
746 await self.execute_hook(
747 "after_goto", page, context=context, url=url, response=response, config=config
748 )
749
[FETCH]... ↓ https://www.pyrenees.ad/alimentacio/es/platos-preparados/sushi/sushi/913-tartar-salmon-200g | ✓ | ⏱: 1.19s
[SCRAPE].. ◆ https://www.pyrenees.ad/alimentacio/es/platos-preparados/sushi/sushi/913-tartar-salmon-200g | ✓ | ⏱: 0.15s
[COMPLETE] ● https://www.pyrenees.ad/alimentacio/es/platos-preparados/sushi/sushi/913-tartar-salmon-200g | ✓ | ⏱: 1.34s
[FETCH]... ↓ https://www.pyrenees.ad/alimentacio/es/bebe/alim.../leches/hipp-leche-continuacion-combiotik-2-800g | ✓ | ⏱: 1.81s
[SCRAPE].. ◆ https://www.pyrenees.ad/alimentacio/es/bebe/alim.../leches/hipp-leche-continuacion-combiotik-2-800g | ✓ | ⏱: 0.17s
[COMPLETE] ● https://www.pyrenees.ad/alimentacio/es/bebe/alim.../leches/hipp-leche-continuacion-combiotik-2-800g | ✓ | ⏱: 1.98s
Failed to crawl: https://www.pyrenees.ad/alimentacio/es/desayunos-dulces-y-pan/bombones/con-licor/maxims-paris-bombones-20u-200g
Error: Unexpected error in _crawl_web at line 744 in _crawl_web (../../../../anaconda3/envs/crawl/lib/python3.12/site-packages/crawl4ai/async_crawler_strategy.py):
Error: Failed on navigating ACS-GOTO:
Page.goto: net::ERR_ABORTED at https://www.pyrenees.ad/alimentacio/es/desayunos-dulces-y-pan/bombones/con-licor/maxims-paris-bombones-20u-200g
Call log:
- navigating to "https://www.pyrenees.ad/alimentacio/es/desayunos-dulces-y-pan/bombones/con-licor/maxims-paris-bombones-20u-200g", waiting until "domcontentloaded"
Code context:
739 response = await page.goto(
740 url, wait_until=config.wait_until, timeout=config.page_timeout
741 )
742 redirected_url = page.url
743 except Error as e:
744 → raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{str(e)}")
745
746 await self.execute_hook(
747 "after_goto", page, context=context, url=url, response=response, config=config
748 )
749
---
Failed to crawl: https://www.pyrenees.ad/alimentacio/es/dietetica/productos-para-celiacos/cerveza/ambar-cerveza-00-sgluten-botella-33clhttps://www.pyrenees.ad/alimentacio/es/lacteos-y-huevos/huevos/grandes-l-63-a-73gr/el-meu-ou-huevos-l-retractil-12u
Error: Unexpected error in _crawl_web at line 744 in _crawl_web (../../../../anaconda3/envs/crawl/lib/python3.12/site-packages/crawl4ai/async_crawler_strategy.py):
Error: Failed on navigating ACS-GOTO:
Page.goto: net::ERR_ABORTED at https://www.pyrenees.ad/alimentacio/es/dietetica/productos-para-celiacos/cerveza/ambar-cerveza-00-sgluten-botella-33clhttps://www.pyrenees.ad/alimentacio/es/lacteos-y-huevos/huevos/grandes-l-63-a-73gr/el-meu-ou-huevos-l-retractil-12u
Call log:
- navigating to "https://www.pyrenees.ad/alimentacio/es/dietetica/productos-para-celiacos/cerveza/ambar-cerveza-00-sgluten-botella-33clhttps://www.pyrenees.ad/alimentacio/es/lacteos-y-huevos/huevos/grandes-l-63-a-73gr/el-meu-ou-huevos-l-retractil-12u", waiting until "domcontentloaded"
Code context:
739 response = await page.goto(
740 url, wait_until=config.wait_until, timeout=config.page_timeout
741 )
742 redirected_url = page.url
743 except Error as e:
744 → raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{str(e)}")
745
746 await self.execute_hook(
747 "after_goto", page, context=context, url=url, response=response, config=config
748 )
749
---
Failed to crawl: https://www.pyrenees.ad/alimentacio/es/lacteos-y-huevos/quesos/tierno-semicurado-y-curado/1605-herencia-4m-manchego-dop-1kg
Error: Unexpected error in _crawl_web at line 744 in _crawl_web (../../../../anaconda3/envs/crawl/lib/python3.12/site-packages/crawl4ai/async_crawler_strategy.py):
Error: Failed on navigating ACS-GOTO:
Page.goto: net::ERR_ABORTED at https://www.pyrenees.ad/alimentacio/es/lacteos-y-huevos/quesos/tierno-semicurado-y-curado/1605-herencia-4m-manchego-dop-1kg
Call log:
- navigating to "https://www.pyrenees.ad/alimentacio/es/lacteos-y-huevos/quesos/tierno-semicurado-y-curado/1605-herencia-4m-manchego-dop-1kg", waiting until "domcontentloaded"
Code context:
739 response = await page.goto(
740 url, wait_until=config.wait_until, timeout=config.page_timeout
741 )
742 redirected_url = page.url
743 except Error as e:
744 → raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{str(e)}")
745
746 await self.execute_hook(
747 "after_goto", page, context=context, url=url, response=response, config=config
748 )
749
---
Failed to crawl: https://www.pyrenees.ad/alimentacio/es/frutas-y-verduras/frutas/fruta/manzana-golden-kg
Error: Unexpected error in _crawl_web at line 744 in _crawl_web (../../../../anaconda3/envs/crawl/lib/python3.12/site-packages/crawl4ai/async_crawler_strategy.py):
Error: Failed on navigating ACS-GOTO:
Page.goto: net::ERR_ABORTED at https://www.pyrenees.ad/alimentacio/es/frutas-y-verduras/frutas/fruta/manzana-golden-kg
Call log:
- navigating to "https://www.pyrenees.ad/alimentacio/es/frutas-y-verduras/frutas/fruta/manzana-golden-kg", waiting until "domcontentloaded"
Code context:
739 response = await page.goto(
740 url, wait_until=config.wait_until, timeout=config.page_timeout
741 )
742 redirected_url = page.url
743 except Error as e:
744 → raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{str(e)}")
745
746 await self.execute_hook(
747 "after_goto", page, context=context, url=url, response=response, config=config
748 )
749
---
{'product_name': '913-TARTAR SALMON 200G', 'ref_code': None, 'price': '9.98', 'url': 'https://www.pyrenees.ad/alimentacio/es/platos-preparados/sushi/sushi/913-tartar-salmon-200g'}
---
{'product_name': '913-TARTAR SALMON 200G', 'ref_code': None, 'price': '9.98', 'url': 'https://www.pyrenees.ad/alimentacio/es/bebe/alimentacion-infantil/leches/hipp-leche-continuacion-combiotik-2-800g'}
---
Hi @Auth0rM0rgan Thanks for using C4ai. It looks like there’s a bit of confusion here, so let me clear things up:
You’re mixing up two ideas:
- session_id: This is meant to keep one browser tab/page alive across multiple calls to
.arun(), so you can reuse cookies or login state—but that only works for things running one after another (sequentially). - arun_many: This is all about running lots of URLs at the same time (in parallel).
If you pass a single CrawlerRunConfig (with, say, session_id="abc") into arun_many, what’s really happening is all your parallel crawls are fighting over the same page/tab. Playwright really doesn’t like it when multiple things try to use the exact same browser page at once—you’ll get weird errors or unexpected results.
So, unfortunately, you can’t just give a whole batch of URLs and, say, a list of different configs (each with their own session_id) to arun_many—at least not yet! I’ve designed arun_many so it takes a list of URLs and just ONE CrawlerRunConfig, and that single config gets reused for all URLs.
I totally see why you’d want to give different configs for different URLs (so each run is totally isolated and can have its own session/tab/cookies/etc). That’s a good idea, and I do have it in my backlog to make arun_many accept both a list of URLs and a list of configs, matching them up one-to-one. But as of now, it’s not possible.
So for the time being, if you really need each crawl to have its own session (and you need them to run in parallel), you’ll have to spin up your own little task loop and call .arun() yourself for each URL, giving each its own config with a unique session_id.
Hope this clears things up a bit! If you have more questions, just let me know, I'm happy to help or clarify.
We have implemented different configurations for different URLs for the "arun_many". Below is an example of the code: https://github.com/unclecode/crawl4ai/blob/main/docs/examples/demo_multi_config_clean.py
I'll close this issue, but feel free to continue the conversation and tag me if the issue persists with our latest version: 0.7.7.