[Bug]: Cannot Perform Deep Crawling When Crawl4AI is Running via Docker
crawl4ai version
0.6.1
Expected Behavior
Crawl4AI should perform deep crawling when using CrawlerRunConfig with a BFSDeepCrawlStrategy, even if the Crawl4AI server is running in a Docker container. I expected it to crawl the root page, discover internal links, and then crawl those internal pages up to the specified depth.
Current Behavior
When running Crawl4AI via Docker and using the Python SDK externally, no data is crawled beyond the root URL. results.success is always False and no markdown content is returned.
Is this reproducible?
Yes
Inputs Causing the Bug
Steps to Reproduce
Code snippets
import asyncio
import os
from crawl4ai import (
BrowserConfig,
CacheMode,
CrawlerRunConfig,
DefaultMarkdownGenerator,
PruningContentFilter,
)
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
from crawl4ai.docker_client import Crawl4aiDockerClient
async def main():
# Load custom JS if needed
browser_script_path = os.path.abspath(
os.path.join(os.path.dirname(__file__), "crawl4ai_browser_script.js")
)
js_code = None
if os.path.exists(browser_script_path):
with open(browser_script_path, "r", encoding="utf-8") as f:
js_code = f.read()
async with Crawl4aiDockerClient(
base_url="http://localhost:11235",
verbose=True, # š Change from localhost if calling from another container
) as client:
# If authentication is truly disabled in config.yml, you can skip this
# Otherwise, uncomment below
await client.authenticate("[email protected]")
print("--- Running Deep Crawl (Streaming Mode) ---")
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
excluded_tags=["header", "footer", "nav"],
remove_overlay_elements=True,
exclude_external_links=True,
exclude_social_media_links=True,
scroll_delay=0.5,
scan_full_page=True,
page_timeout=18000000,
js_code=js_code,
stream=True,
deep_crawl_strategy=BFSDeepCrawlStrategy(
max_depth=2,
include_external=False,
max_pages=20,
),
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter(
threshold=0.48,
threshold_type="fixed",
min_word_threshold=0,
),
options={"ignore_links": True},
),
)
async for result in await client.crawl(
["https://www.dienmayxanh.com/"],
browser_config=BrowserConfig(headless=True),
crawler_config=crawler_config,
):
print(f"\nā
URL: {result.url}")
print(f"Success: {result.success}")
if result.markdown:
print(result.markdown.raw_markdown[:300]) # Print first 300 chars
internal_links: list[dict] = result.links.get("internal", [])
print(f"Found {len(internal_links)} internal links")
else:
print("No markdown content found.")
if __name__ == "__main__":
asyncio.run(main())
OS
macOS
Python version
3.10.12
Browser
No response
Browser version
No response
Error logs & Screenshots (if applicable)
No response
I was able to reproduce the issue and error message is "'async_generator' object has no attribute 'status_code'". This seems to be happening with stream=True and the issue is in the docker_client.
This issue is related to https://github.com/unclecode/crawl4ai/issues/1066
Fix: request /crawl with stream: true issue #1074
In this PR, I'd like to propose adding a feature that automatically redirects requests with stream: true to the /crawl/stream endpoint using a 307 temporary redirect. This ensures the request method and body are preserved during redirection. I'm open to feedback and willing to contribute any necessary code improvements. @aravindkarnam
related to https://github.com/unclecode/crawl4ai/issues/1205
Fixed in the newest release, 0.7.6. please pull the latest image