crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

[Bug]: Get the same results for different URLs and Failed on navigating ACS-GOTO with multiple urls crawling

Open Auth0rM0rgan opened this issue 6 months ago • 1 comments

crawl4ai version

0.6.3

Expected Behavior

Getting the data of each link correctly and not getting the Failed on navigating ACS-GOTO with multiple URLs crawl.

I tried using a different cache_mode but the problems still exist, If I put it CacheMode.ENABLED I get less time the Failed on navigating ACS-GOTO error but I can still happen -- much less.

Current Behavior

I am getting the same results even though the URLs are different! This is not always the case, and it doesn't happen for all URLs. If you look at the last two outputs, the URLs are different, but the data is the same. However, if I use one URL instead of multiple URLs, I don't have this problem, and the data is coming correctly.

Also, I am getting the Failed on navigating ACS-GOTO error (Not always, it's random... sometimes getting an error, sometimes it is fine) when I crawl multiple URLs (Not issue if I do a single crawl)

Is this reproducible?

Yes

Inputs Causing the Bug


Steps to Reproduce


Code snippets

import os
import asyncio
import re
import time
from typing import List
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, CrawlerMonitor, DisplayMode, \
    GeolocationConfig, RegexExtractionStrategy
from crawl4ai.async_dispatcher import MemoryAdaptiveDispatcher, RateLimiter
from typing import Optional, Dict

async def crawl_parallel(urls: List[str]):
    """
    Parallel crawling while reusing browser instance - best for large workloads
    """
    print("\n=== Parallel Crawling with Browser Reuse ===")


    browser_config = BrowserConfig(
        browser_mode="builtin",
        headless=True,
        # browser_type="chromium",
        use_managed_browser=True, # Enable advanced browser management features
        use_persistent_context=True, # Persist the browser session across runs
        # viewport={
        #     "width": 1920,
        #     "height": 1080,
        # },
        # user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
        extra_args=["--disable-dev-shm-usage", "--disable-extensions"],
        user_agent_mode="random",
        user_agent_generator_config={
            "platforms": ["mobile"],
            "os": ["Android"], #Android, iOS
            "browsers": ["Chrome Mobile"], #Chrome Mobile, Chrome Mobile iOS
        },

        text_mode=True, # Disable images and heavy content for faster page loads
        light_mode=True, # Enable a minimalistic browser setup for performance gains
        ignore_https_errors=True, # Ignore HTTPS certificate errors if True
        user_data_dir="/home/Artur/.crawl4ai/profiles/my-login-profile", # Directory to store persistent browser data
        # chrome_channel="chrome",
        # channel="chrome",
    )

    crawl_config = CrawlerRunConfig(
        markdown_generator=DefaultMarkdownGenerator(
            #  content_filter=PruningContentFilter(), In case you need fit_markdown
        ),
        locale="ca-AD",  # Accept-Language & UI locale set to Spanish (Spain)
        timezone_id="Europe/Andorra",  # JS Date()/Intl timezone set to Madrid, Spain
        geolocation=GeolocationConfig(  # override GPS coords with coordinates for Madrid
            latitude=42.5077902,
            longitude=1.5210902,
            accuracy=10.0,  # accuracy in meters
        ),
        js_code=[
            # accept_button_script
        ],

        js_only=False,
        word_count_threshold=3, #
        excluded_tags=["nav", "footer"],
        cache_mode=CacheMode.BYPASS, #
        screenshot=False,
        stream=False, #
        wait_for_images=False, # Wait for images to load before capturing the final HTML snapshot
        wait_for=None, # CSS selector or JavaScript condition to wait for before capturing the HTML
        delay_before_return_html=0.1, # Additional delay (in seconds) before returning the final HTML content
        # page_timeout=60000,
        mean_delay = 0.5, #
        max_range = 0.9, #
        semaphore_count = 3, #
        scroll_delay = 0.4, # Delay between scroll actions during full-page scanning
        scan_full_page=False, # Automatically scroll the entire page to load dynamic content
        process_iframes=False, #
        remove_overlay_elements=True, # Remove overlay elements (e.g., popups, cookie banners) after page load
        magic=True, # Enable advanced overlay handling
        simulate_user=True, # Simulate user interactions to avoid bot detection
        override_navigator=True, # Override navigator properties for stealth purposes
        capture_mhtml=False, #
        image_score_threshold=1, #
        exclude_internal_links=True,
        exclude_external_images= True, #
        exclude_external_links=True,
        check_robots_txt=True, #
        session_id="persistent_session",
        keep_data_attributes=False,
        only_text=True,
        # css_selector=".main-content",
        remove_forms=True,

    )

    dispatcher = MemoryAdaptiveDispatcher(
        rate_limiter=None,
        max_session_permit=len(urls),  # allow all URLs concurrently
        memory_threshold_percent = 95.0,
        # monitor=CrawlerMonitor(),

    )

    start_time = time.perf_counter()
    async with AsyncWebCrawler(config=browser_config) as crawler:
        result_container = await crawler.arun_many(urls=urls, config=crawl_config, dispatcher=dispatcher, extraction_strategy=None)
        results = []
        if isinstance(result_container, list):
            results = result_container
        else:
            async for res in result_container:
                results.append(res)
        for result in results:
            if result.success:
                info = extract_product_info(result)
                print(info)
                # print(result.markdown)
                # print(result.cleaned_html)
                # print("Extraction JSON:", result.extracted_content)
                # print(f"Successfully crawled: {result.url}")
                # print(f"Title: {result.metadata.get('title', 'N/A')}")
                # print(f"Description: {result.metadata.get('description', 'N/A')}")

                # print(result.markdown)


                # print(f"Word count: {len(result.markdown.split())}")
                # print(f"Number of internal links: {len(result.links.get('internal', []))}")
                # print(f"Number of external links: {len(result.links.get('external', []))}")
                # print(f"Number of images: {len(result.media.get('images', []))}")
                # print(f"Image links: {result.media.get('images', [])}")
                # print(f"Number of videos: {len(result.metadata.get('videos', []))}")
                print("---")
                # if result.screenshot:
                #     from base64 import b64decode
                #     with open(os.path.join(__location__, "screenshot2.png"), "wb") as f:
                #         f.write(b64decode(result.screenshot))

            else:
                print(f"Failed to crawl: {result.url}")
                print(f"Error: {result.error_message}")
                print("---")

    total_time = time.perf_counter() - start_time
    return total_time, results



async def main():
    urls = [
        "https://www.pyrenees.ad/alimentacio/es/bebe/alimentacion-infantil/leches/hipp-leche-continuacion-combiotik-2-800g",
        "https://www.pyrenees.ad/alimentacio/es/dietetica/productos-para-celiacos/cerveza/ambar-cerveza-00-sgluten-botella-33cl"
        "https://www.pyrenees.ad/alimentacio/es/lacteos-y-huevos/huevos/grandes-l-63-a-73gr/el-meu-ou-huevos-l-retractil-12u",
        "https://www.pyrenees.ad/alimentacio/es/frutas-y-verduras/frutas/fruta/manzana-golden-kg",
        "https://www.pyrenees.ad/alimentacio/es/desayunos-dulces-y-pan/bombones/con-licor/maxims-paris-bombones-20u-200g",
        "https://www.pyrenees.ad/alimentacio/es/lacteos-y-huevos/quesos/tierno-semicurado-y-curado/1605-herencia-4m-manchego-dop-1kg",
        "https://www.pyrenees.ad/alimentacio/es/platos-preparados/sushi/sushi/913-tartar-salmon-200g",

    ]

    await crawl_parallel(urls)


if __name__ == "__main__":
    asyncio.run(main())

OS

Linux

Python version

3.12

Browser

Chrome

Browser version

No response

Error logs & Screenshots (if applicable)

=== Parallel Crawling with Browser Reuse ===
[BROWSER]. ℹ pre-launch cleanup failed: Command '[['lsof', '-t', '-i:9222']]' returned non-zero exit status 1. 
[INIT].... → Crawl4AI 0.6.3 
[ERROR]... Ă— https://www.pyrenees.ad...paris-bombones-20u-200g  | Error: Unexpected error in _crawl_web at line 744 in _crawl_web 
(../../../../anaconda3/envs/crawl/lib/python3.12/site-packages/crawl4ai/async_crawler_strategy.py):
Error: Failed on navigating ACS-GOTO:
Page.goto: net::ERR_ABORTED at https://www.pyrenees.ad/alimentacio/es/desayunos-dulces-y-pan/bombones/con-licor/maxims-paris-bombones-20u-200g
Call log:
  - navigating to "https://www.pyrenees.ad/alimentacio/es/desayunos-dulces-y-pan/bombones/con-licor/maxims-paris-bombones-20u-200g", waiting until "domcontentloaded"


Code context:
 739                       response = await page.goto(
 740                           url, wait_until=config.wait_until, timeout=config.page_timeout
 741                       )
 742                       redirected_url = page.url
 743                   except Error as e:
 744 →                     raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{str(e)}")
 745   
 746                   await self.execute_hook(
 747                       "after_goto", page, context=context, url=url, response=response, config=config
 748                   )
 749    
[ERROR]... Ă— https://www.pyrenees.ad...-huevos-l-retractil-12u  | Error: Unexpected error in _crawl_web at line 744 in _crawl_web 
(../../../../anaconda3/envs/crawl/lib/python3.12/site-packages/crawl4ai/async_crawler_strategy.py):
Error: Failed on navigating ACS-GOTO:
Page.goto: net::ERR_ABORTED at 
https://www.pyrenees.ad/alimentacio/es/dietetica/productos-para-celiacos/cerveza/ambar-cerveza-00-sgluten-botella-33clhttps://www.pyrenees.ad/alimentacio/es/lacteos-y-huevos/huevos/grandes-l-63-a-73gr/el-meu-ou-huevos-l
-retractil-12u
Call log:
  - navigating to 
"https://www.pyrenees.ad/alimentacio/es/dietetica/productos-para-celiacos/cerveza/ambar-cerveza-00-sgluten-botella-33clhttps://www.pyrenees.ad/alimentacio/es/lacteos-y-huevos/huevos/grandes-l-63-a-73gr/el-meu-ou-huevos-
l-retractil-12u", waiting until "domcontentloaded"


Code context:
 739                       response = await page.goto(
 740                           url, wait_until=config.wait_until, timeout=config.page_timeout
 741                       )
 742                       redirected_url = page.url
 743                   except Error as e:
 744 →                     raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{str(e)}")
 745   
 746                   await self.execute_hook(
 747                       "after_goto", page, context=context, url=url, response=response, config=config
 748                   )
 749    
[ERROR]... Ă— https://www.pyrenees.ad...fruta/manzana-golden-kg  | Error: Unexpected error in _crawl_web at line 744 in _crawl_web 
(../../../../anaconda3/envs/crawl/lib/python3.12/site-packages/crawl4ai/async_crawler_strategy.py):
Error: Failed on navigating ACS-GOTO:
Page.goto: net::ERR_ABORTED at https://www.pyrenees.ad/alimentacio/es/frutas-y-verduras/frutas/fruta/manzana-golden-kg
Call log:
  - navigating to "https://www.pyrenees.ad/alimentacio/es/frutas-y-verduras/frutas/fruta/manzana-golden-kg", waiting until "domcontentloaded"


Code context:
 739                       response = await page.goto(
 740                           url, wait_until=config.wait_until, timeout=config.page_timeout
 741                       )
 742                       redirected_url = page.url
 743                   except Error as e:
 744 →                     raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{str(e)}")
 745   
 746                   await self.execute_hook(
 747                       "after_goto", page, context=context, url=url, response=response, config=config
 748                   )
 749    
[ERROR]... Ă— https://www.pyrenees.ad...cia-4m-manchego-dop-1kg  | Error: Unexpected error in _crawl_web at line 744 in _crawl_web 
(../../../../anaconda3/envs/crawl/lib/python3.12/site-packages/crawl4ai/async_crawler_strategy.py):
Error: Failed on navigating ACS-GOTO:
Page.goto: net::ERR_ABORTED at https://www.pyrenees.ad/alimentacio/es/lacteos-y-huevos/quesos/tierno-semicurado-y-curado/1605-herencia-4m-manchego-dop-1kg
Call log:
  - navigating to "https://www.pyrenees.ad/alimentacio/es/lacteos-y-huevos/quesos/tierno-semicurado-y-curado/1605-herencia-4m-manchego-dop-1kg", waiting until "domcontentloaded"


Code context:
 739                       response = await page.goto(
 740                           url, wait_until=config.wait_until, timeout=config.page_timeout
 741                       )
 742                       redirected_url = page.url
 743                   except Error as e:
 744 →                     raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{str(e)}")
 745   
 746                   await self.execute_hook(
 747                       "after_goto", page, context=context, url=url, response=response, config=config
 748                   )
 749    
[FETCH]... ↓ https://www.pyrenees.ad/alimentacio/es/platos-preparados/sushi/sushi/913-tartar-salmon-200g          | ✓ | ⏱: 1.19s 
[SCRAPE].. ◆ https://www.pyrenees.ad/alimentacio/es/platos-preparados/sushi/sushi/913-tartar-salmon-200g          | ✓ | ⏱: 0.15s 
[COMPLETE] ● https://www.pyrenees.ad/alimentacio/es/platos-preparados/sushi/sushi/913-tartar-salmon-200g          | ✓ | ⏱: 1.34s 
[FETCH]... ↓ https://www.pyrenees.ad/alimentacio/es/bebe/alim.../leches/hipp-leche-continuacion-combiotik-2-800g  | ✓ | ⏱: 1.81s 
[SCRAPE].. ◆ https://www.pyrenees.ad/alimentacio/es/bebe/alim.../leches/hipp-leche-continuacion-combiotik-2-800g  | ✓ | ⏱: 0.17s 
[COMPLETE] ● https://www.pyrenees.ad/alimentacio/es/bebe/alim.../leches/hipp-leche-continuacion-combiotik-2-800g  | ✓ | ⏱: 1.98s 
Failed to crawl: https://www.pyrenees.ad/alimentacio/es/desayunos-dulces-y-pan/bombones/con-licor/maxims-paris-bombones-20u-200g
Error: Unexpected error in _crawl_web at line 744 in _crawl_web (../../../../anaconda3/envs/crawl/lib/python3.12/site-packages/crawl4ai/async_crawler_strategy.py):
Error: Failed on navigating ACS-GOTO:
Page.goto: net::ERR_ABORTED at https://www.pyrenees.ad/alimentacio/es/desayunos-dulces-y-pan/bombones/con-licor/maxims-paris-bombones-20u-200g
Call log:
  - navigating to "https://www.pyrenees.ad/alimentacio/es/desayunos-dulces-y-pan/bombones/con-licor/maxims-paris-bombones-20u-200g", waiting until "domcontentloaded"


Code context:
 739                       response = await page.goto(
 740                           url, wait_until=config.wait_until, timeout=config.page_timeout
 741                       )
 742                       redirected_url = page.url
 743                   except Error as e:
 744 →                     raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{str(e)}")
 745   
 746                   await self.execute_hook(
 747                       "after_goto", page, context=context, url=url, response=response, config=config
 748                   )
 749   
---
Failed to crawl: https://www.pyrenees.ad/alimentacio/es/dietetica/productos-para-celiacos/cerveza/ambar-cerveza-00-sgluten-botella-33clhttps://www.pyrenees.ad/alimentacio/es/lacteos-y-huevos/huevos/grandes-l-63-a-73gr/el-meu-ou-huevos-l-retractil-12u
Error: Unexpected error in _crawl_web at line 744 in _crawl_web (../../../../anaconda3/envs/crawl/lib/python3.12/site-packages/crawl4ai/async_crawler_strategy.py):
Error: Failed on navigating ACS-GOTO:
Page.goto: net::ERR_ABORTED at https://www.pyrenees.ad/alimentacio/es/dietetica/productos-para-celiacos/cerveza/ambar-cerveza-00-sgluten-botella-33clhttps://www.pyrenees.ad/alimentacio/es/lacteos-y-huevos/huevos/grandes-l-63-a-73gr/el-meu-ou-huevos-l-retractil-12u
Call log:
  - navigating to "https://www.pyrenees.ad/alimentacio/es/dietetica/productos-para-celiacos/cerveza/ambar-cerveza-00-sgluten-botella-33clhttps://www.pyrenees.ad/alimentacio/es/lacteos-y-huevos/huevos/grandes-l-63-a-73gr/el-meu-ou-huevos-l-retractil-12u", waiting until "domcontentloaded"


Code context:
 739                       response = await page.goto(
 740                           url, wait_until=config.wait_until, timeout=config.page_timeout
 741                       )
 742                       redirected_url = page.url
 743                   except Error as e:
 744 →                     raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{str(e)}")
 745   
 746                   await self.execute_hook(
 747                       "after_goto", page, context=context, url=url, response=response, config=config
 748                   )
 749   
---
Failed to crawl: https://www.pyrenees.ad/alimentacio/es/lacteos-y-huevos/quesos/tierno-semicurado-y-curado/1605-herencia-4m-manchego-dop-1kg
Error: Unexpected error in _crawl_web at line 744 in _crawl_web (../../../../anaconda3/envs/crawl/lib/python3.12/site-packages/crawl4ai/async_crawler_strategy.py):
Error: Failed on navigating ACS-GOTO:
Page.goto: net::ERR_ABORTED at https://www.pyrenees.ad/alimentacio/es/lacteos-y-huevos/quesos/tierno-semicurado-y-curado/1605-herencia-4m-manchego-dop-1kg
Call log:
  - navigating to "https://www.pyrenees.ad/alimentacio/es/lacteos-y-huevos/quesos/tierno-semicurado-y-curado/1605-herencia-4m-manchego-dop-1kg", waiting until "domcontentloaded"


Code context:
 739                       response = await page.goto(
 740                           url, wait_until=config.wait_until, timeout=config.page_timeout
 741                       )
 742                       redirected_url = page.url
 743                   except Error as e:
 744 →                     raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{str(e)}")
 745   
 746                   await self.execute_hook(
 747                       "after_goto", page, context=context, url=url, response=response, config=config
 748                   )
 749   
---
Failed to crawl: https://www.pyrenees.ad/alimentacio/es/frutas-y-verduras/frutas/fruta/manzana-golden-kg
Error: Unexpected error in _crawl_web at line 744 in _crawl_web (../../../../anaconda3/envs/crawl/lib/python3.12/site-packages/crawl4ai/async_crawler_strategy.py):
Error: Failed on navigating ACS-GOTO:
Page.goto: net::ERR_ABORTED at https://www.pyrenees.ad/alimentacio/es/frutas-y-verduras/frutas/fruta/manzana-golden-kg
Call log:
  - navigating to "https://www.pyrenees.ad/alimentacio/es/frutas-y-verduras/frutas/fruta/manzana-golden-kg", waiting until "domcontentloaded"


Code context:
 739                       response = await page.goto(
 740                           url, wait_until=config.wait_until, timeout=config.page_timeout
 741                       )
 742                       redirected_url = page.url
 743                   except Error as e:
 744 →                     raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{str(e)}")
 745   
 746                   await self.execute_hook(
 747                       "after_goto", page, context=context, url=url, response=response, config=config
 748                   )
 749   
---
{'product_name': '913-TARTAR SALMON 200G', 'ref_code': None, 'price': '9.98', 'url': 'https://www.pyrenees.ad/alimentacio/es/platos-preparados/sushi/sushi/913-tartar-salmon-200g'}
---
{'product_name': '913-TARTAR SALMON 200G', 'ref_code': None, 'price': '9.98', 'url': 'https://www.pyrenees.ad/alimentacio/es/bebe/alimentacion-infantil/leches/hipp-leche-continuacion-combiotik-2-800g'}
---

Auth0rM0rgan avatar May 20 '25 16:05 Auth0rM0rgan

Hi @Auth0rM0rgan Thanks for using C4ai. It looks like there’s a bit of confusion here, so let me clear things up:

You’re mixing up two ideas:

  • session_id: This is meant to keep one browser tab/page alive across multiple calls to .arun(), so you can reuse cookies or login state—but that only works for things running one after another (sequentially).
  • arun_many: This is all about running lots of URLs at the same time (in parallel).

If you pass a single CrawlerRunConfig (with, say, session_id="abc") into arun_many, what’s really happening is all your parallel crawls are fighting over the same page/tab. Playwright really doesn’t like it when multiple things try to use the exact same browser page at once—you’ll get weird errors or unexpected results.

So, unfortunately, you can’t just give a whole batch of URLs and, say, a list of different configs (each with their own session_id) to arun_many—at least not yet! I’ve designed arun_many so it takes a list of URLs and just ONE CrawlerRunConfig, and that single config gets reused for all URLs.

I totally see why you’d want to give different configs for different URLs (so each run is totally isolated and can have its own session/tab/cookies/etc). That’s a good idea, and I do have it in my backlog to make arun_many accept both a list of URLs and a list of configs, matching them up one-to-one. But as of now, it’s not possible.

So for the time being, if you really need each crawl to have its own session (and you need them to run in parallel), you’ll have to spin up your own little task loop and call .arun() yourself for each URL, giving each its own config with a unique session_id.

Hope this clears things up a bit! If you have more questions, just let me know, I'm happy to help or clarify.

unclecode avatar May 21 '25 07:05 unclecode

We have implemented different configurations for different URLs for the "arun_many". Below is an example of the code: https://github.com/unclecode/crawl4ai/blob/main/docs/examples/demo_multi_config_clean.py

ntohidi avatar Nov 14 '25 11:11 ntohidi

I'll close this issue, but feel free to continue the conversation and tag me if the issue persists with our latest version: 0.7.7.

ntohidi avatar Nov 14 '25 11:11 ntohidi