[Bug]: `arun_many` doesn't parallelize tasks when using `raw://`
crawl4ai version
0.4.3b2 (also on 0.4.3b3)
Expected Behavior
Using arun_many with raw URLs should parallelize when max_session_permit>1.
This is based on the features demo in docs/examples/v0_4_3b2_features_demo.py
Current Behavior
Using arun_many doesn't parallelize tasks when using raw:// URLs, but works correctly with regular HTTP URLs. When using raw HTML content, only one task is active at a time, ignoring the max_session_permit setting.
Using raw HTMLs:
Using regular HTTP URLs:
Is this reproducible?
Yes
Inputs Causing the Bug
No response
Steps to Reproduce
This is based on the demo_memory_dispatcher method in the demo in https://github.com/unclecode/crawl4ai/blob/d0586f09a946e8e70e34e7e3b670ca165c7d71ec/docs/examples/v0_4_3b2_features_demo.py
Using pixi... set up the directory as follows:
⯠tree .
.
āāā main.py
āāā pixi.toml
# pixi.toml
[project]
channels = ["conda-forge"]
description = "Add a short description here"
name = "issue-crawl4ai"
platforms = ["osx-arm64"]
version = "0.1.0"
[tasks]
postinstall = "pip install Crawl4AI==0.4.3b2 && crawl4ai-setup && crawl4ai-doctor"
[dependencies]
python = "3.11.0"
pip = "*"
and the script:
# main.py
import asyncio
from crawl4ai import (
AsyncWebCrawler,
BrowserConfig,
CacheMode,
CrawlerMonitor,
CrawlerRunConfig,
DefaultMarkdownGenerator,
DisplayMode,
MemoryAdaptiveDispatcher,
)
async def demo_memory_dispatcher(use_raw: bool) -> None:
print("\n=== Memory Dispatcher Demo ===")
try:
# Configuration
browser_config = BrowserConfig(headless=True, verbose=False)
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS, markdown_generator=DefaultMarkdownGenerator()
)
# Test URLs
if not use_raw:
urls = [
"http://example.com",
"http://example.org",
"http://example.net",
] * 50
else:
dummy_html = """
<html>
<body>
<div class='crypto-row'>
<h2 class='coin-name'>Bitcoin</h2>
<span class='coin-price'>$28,000</span>
</div>
<div class='crypto-row'>
<h2 class='coin-name'>Ethereum</h2>
<span class='coin-price'>$1,800</span>
</div>
</body>
</html>
"""
urls = [f"raw://{dummy_html}"] * 1000
print("\nš Initializing crawler with memory monitoring...")
async with AsyncWebCrawler(config=browser_config) as crawler:
monitor = CrawlerMonitor(
max_visible_rows=10, display_mode=DisplayMode.DETAILED
)
dispatcher = MemoryAdaptiveDispatcher(
memory_threshold_percent=80.0,
check_interval=0.5,
max_session_permit=5,
monitor=monitor,
)
print("\nš Starting batch crawl...")
results = await crawler.arun_many(
urls=urls, config=crawler_config, dispatcher=dispatcher
)
print(f"\nā
Completed {len(results)} URLs successfully")
except Exception as e:
print(f"\nā Error in memory dispatcher demo: {str(e)}")
async def main():
"""Run all feature demonstrations."""
print("\nš Running Crawl4ai v0.4.3 Feature Demos\n")
# Efficiency & Speed Demos
print("This shows that there are 5 active tasks at the same time")
await demo_memory_dispatcher(use_raw=False)
print("This is not working, it shows that there are only 1 active task at a time")
await demo_memory_dispatcher(use_raw=True)
if __name__ == "__main__":
asyncio.run(main())
Code snippets
Basically run the main.py script above:
pixi install
pixi run postinstall
pixi run python main.py
OS
macOS
Python version
3.11.0
Browser
No response
Browser version
No response
Error logs & Screenshots (if applicable)
No response
@aravindkarnam This is an odd case, have to check this by myself.
I am noticing the same behavior for crawls when using file://. Has a fix been implemented for this?
@prachipatil-ds Not yet, but we picked it up in the current sprint. Hopefully a fix will be done in next couple of weeks!!
This issue has been resolved in the develop branch. I would appreciate it if you all could help test and check it out.
This issue has been resolved in the develop branch. I would appreciate it if you all could help test and check it out.
just tested it on version 0.7.4 and it seems to parallelize properly. thanks!
already merged with the main branch and the latest version (0.7.4)