crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

[Bug]: `arun_many` doesn't parallelize tasks when using `raw://`

Open mohahf19 opened this issue 11 months ago • 1 comments

crawl4ai version

0.4.3b2 (also on 0.4.3b3)

Expected Behavior

Using arun_many with raw URLs should parallelize when max_session_permit>1.

This is based on the features demo in docs/examples/v0_4_3b2_features_demo.py

Current Behavior

Using arun_many doesn't parallelize tasks when using raw:// URLs, but works correctly with regular HTTP URLs. When using raw HTML content, only one task is active at a time, ignoring the max_session_permit setting.

Using raw HTMLs: Image

Using regular HTTP URLs: Image

Is this reproducible?

Yes

Inputs Causing the Bug

No response

Steps to Reproduce

This is based on the demo_memory_dispatcher method in the demo in https://github.com/unclecode/crawl4ai/blob/d0586f09a946e8e70e34e7e3b670ca165c7d71ec/docs/examples/v0_4_3b2_features_demo.py

Using pixi... set up the directory as follows:

āÆ tree .
.
ā”œā”€ā”€ main.py
└── pixi.toml
# pixi.toml
[project]
channels = ["conda-forge"]
description = "Add a short description here"
name = "issue-crawl4ai"
platforms = ["osx-arm64"]
version = "0.1.0"

[tasks]
postinstall = "pip install Crawl4AI==0.4.3b2 && crawl4ai-setup && crawl4ai-doctor"

[dependencies]
python = "3.11.0"
pip = "*"

and the script:

# main.py
import asyncio

from crawl4ai import (
    AsyncWebCrawler,
    BrowserConfig,
    CacheMode,
    CrawlerMonitor,
    CrawlerRunConfig,
    DefaultMarkdownGenerator,
    DisplayMode,
    MemoryAdaptiveDispatcher,
)


async def demo_memory_dispatcher(use_raw: bool) -> None:
    print("\n=== Memory Dispatcher Demo ===")

    try:
        # Configuration
        browser_config = BrowserConfig(headless=True, verbose=False)
        crawler_config = CrawlerRunConfig(
            cache_mode=CacheMode.BYPASS, markdown_generator=DefaultMarkdownGenerator()
        )

        # Test URLs
        if not use_raw:
            urls = [
                "http://example.com",
                "http://example.org",
                "http://example.net",
            ] * 50
        else:
            dummy_html = """
            <html>
            <body>
                <div class='crypto-row'>
                <h2 class='coin-name'>Bitcoin</h2>
                <span class='coin-price'>$28,000</span>
                </div>
                <div class='crypto-row'>
                <h2 class='coin-name'>Ethereum</h2>
                <span class='coin-price'>$1,800</span>
                </div>
            </body>
            </html>
            """
            urls = [f"raw://{dummy_html}"] * 1000

        print("\nšŸ“ˆ Initializing crawler with memory monitoring...")
        async with AsyncWebCrawler(config=browser_config) as crawler:
            monitor = CrawlerMonitor(
                max_visible_rows=10, display_mode=DisplayMode.DETAILED
            )

            dispatcher = MemoryAdaptiveDispatcher(
                memory_threshold_percent=80.0,
                check_interval=0.5,
                max_session_permit=5,
                monitor=monitor,
            )

            print("\nšŸš€ Starting batch crawl...")
            results = await crawler.arun_many(
                urls=urls, config=crawler_config, dispatcher=dispatcher
            )
            print(f"\nāœ… Completed {len(results)} URLs successfully")

    except Exception as e:
        print(f"\nāŒ Error in memory dispatcher demo: {str(e)}")


async def main():
    """Run all feature demonstrations."""
    print("\nšŸ“Š Running Crawl4ai v0.4.3 Feature Demos\n")

    # Efficiency & Speed Demos
    print("This shows that there are 5 active tasks at the same time")
    await demo_memory_dispatcher(use_raw=False)

    print("This is not working, it shows that there are only 1 active task at a time")
    await demo_memory_dispatcher(use_raw=True)


if __name__ == "__main__":
    asyncio.run(main())

Code snippets

Basically run the main.py script above:

pixi install
pixi run postinstall
pixi run python main.py

OS

macOS

Python version

3.11.0

Browser

No response

Browser version

No response

Error logs & Screenshots (if applicable)

No response

mohahf19 avatar Jan 25 '25 16:01 mohahf19

@aravindkarnam This is an odd case, have to check this by myself.

unclecode avatar Jan 28 '25 15:01 unclecode

I am noticing the same behavior for crawls when using file://. Has a fix been implemented for this?

prachipatil-ds avatar Jul 15 '25 18:07 prachipatil-ds

@prachipatil-ds Not yet, but we picked it up in the current sprint. Hopefully a fix will be done in next couple of weeks!!

aravindkarnam avatar Aug 05 '25 09:08 aravindkarnam

This issue has been resolved in the develop branch. I would appreciate it if you all could help test and check it out.

ntohidi avatar Aug 12 '25 08:08 ntohidi

This issue has been resolved in the develop branch. I would appreciate it if you all could help test and check it out.

just tested it on version 0.7.4 and it seems to parallelize properly. thanks!

mohahf19 avatar Aug 17 '25 13:08 mohahf19

already merged with the main branch and the latest version (0.7.4)

ntohidi avatar Aug 18 '25 03:08 ntohidi