crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

[Bug]: Cannot Perform Deep Crawling When Crawl4AI is Running via Docker

Open QuangTQV opened this issue 7 months ago • 3 comments

crawl4ai version

0.6.1

Expected Behavior

Crawl4AI should perform deep crawling when using CrawlerRunConfig with a BFSDeepCrawlStrategy, even if the Crawl4AI server is running in a Docker container. I expected it to crawl the root page, discover internal links, and then crawl those internal pages up to the specified depth.

Current Behavior

When running Crawl4AI via Docker and using the Python SDK externally, no data is crawled beyond the root URL. results.success is always False and no markdown content is returned.

Is this reproducible?

Yes

Inputs Causing the Bug


Steps to Reproduce


Code snippets

import asyncio
import os

from crawl4ai import (
    BrowserConfig,
    CacheMode,
    CrawlerRunConfig,
    DefaultMarkdownGenerator,
    PruningContentFilter,
)
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
from crawl4ai.docker_client import Crawl4aiDockerClient


async def main():
    # Load custom JS if needed
    browser_script_path = os.path.abspath(
        os.path.join(os.path.dirname(__file__), "crawl4ai_browser_script.js")
    )
    js_code = None
    if os.path.exists(browser_script_path):
        with open(browser_script_path, "r", encoding="utf-8") as f:
            js_code = f.read()

    async with Crawl4aiDockerClient(
        base_url="http://localhost:11235",
        verbose=True,  # šŸ” Change from localhost if calling from another container
    ) as client:
        # If authentication is truly disabled in config.yml, you can skip this
        # Otherwise, uncomment below
        await client.authenticate("[email protected]")

        print("--- Running Deep Crawl (Streaming Mode) ---")

        crawler_config = CrawlerRunConfig(
            cache_mode=CacheMode.BYPASS,
            excluded_tags=["header", "footer", "nav"],
            remove_overlay_elements=True,
            exclude_external_links=True,
            exclude_social_media_links=True,
            scroll_delay=0.5,
            scan_full_page=True,
            page_timeout=18000000,
            js_code=js_code,
            stream=True,
            deep_crawl_strategy=BFSDeepCrawlStrategy(
                max_depth=2,
                include_external=False,
                max_pages=20,
            ),
            markdown_generator=DefaultMarkdownGenerator(
                content_filter=PruningContentFilter(
                    threshold=0.48,
                    threshold_type="fixed",
                    min_word_threshold=0,
                ),
                options={"ignore_links": True},
            ),
        )

        async for result in await client.crawl(
            ["https://www.dienmayxanh.com/"],
            browser_config=BrowserConfig(headless=True),
            crawler_config=crawler_config,
        ):
            print(f"\nāœ… URL: {result.url}")
            print(f"Success: {result.success}")
            if result.markdown:
                print(result.markdown.raw_markdown[:300])  # Print first 300 chars
                internal_links: list[dict] = result.links.get("internal", [])
                print(f"Found {len(internal_links)} internal links")
            else:
                print("No markdown content found.")


if __name__ == "__main__":
    asyncio.run(main())

OS

macOS

Python version

3.10.12

Browser

No response

Browser version

No response

Error logs & Screenshots (if applicable)

No response

QuangTQV avatar Apr 29 '25 10:04 QuangTQV

I was able to reproduce the issue and error message is "'async_generator' object has no attribute 'status_code'". This seems to be happening with stream=True and the issue is in the docker_client.

aravindkarnam avatar May 14 '25 07:05 aravindkarnam

This issue is related to https://github.com/unclecode/crawl4ai/issues/1066

aravindkarnam avatar May 14 '25 11:05 aravindkarnam

This issue is related to #1066ę­¤é—®é¢˜äøŽ #1066

Fix: request /crawl with stream: true issue #1074 In this PR, I'd like to propose adding a feature that automatically redirects requests with stream: true to the /crawl/stream endpoint using a 307 temporary redirect. This ensures the request method and body are preserved during redirection. I'm open to feedback and willing to contribute any necessary code improvements. @aravindkarnam

zinodynn avatar May 14 '25 16:05 zinodynn

related to https://github.com/unclecode/crawl4ai/issues/1205

peterkostadinov avatar Jul 08 '25 14:07 peterkostadinov

Fixed in the newest release, 0.7.6. please pull the latest image

ntohidi avatar Oct 22 '25 13:10 ntohidi