crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

[Bug]: A potential fix for issue #1071, but found a new bug with arun_many() in DFSCrawl

Open YJHJACK opened this issue 4 months ago • 4 comments

crawl4ai version

0.7.1

Expected Behavior

Crawl all internal links in DFS order up to the specified depth and maximum number of pages, regardless of whether arun() or arun_many() is used. The behavior should be consistent whether crawling a single URL or multiple URLs concurrently.

Current Behavior

The current dfs_strategy.py implementation always returns only one crawled page, regardless of the specified depth or max_pages. I proposed a potential fix by modifying how the strategy handles link discovery, which allows DFS crawling to work correctly when using arun() on a single URL. Further details are available in the "Error Logs & Screenshots" section.

However, when using arun_many() with multiple URLs, the same issue reappears. The crawler only returns the exact number of URLs initially provided, and does not go deeper into internal links. DFS crawling is effectively limited to just the input URLs in this case.

Is this reproducible?

Yes

Inputs Causing the Bug


Steps to Reproduce


Code snippets

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.deep_crawling import DFSDeepCrawlStrategy

async def main():
    run_cfg = CrawlerRunConfig(
        deep_crawl_strategy=DFSDeepCrawlStrategy(
            max_depth=2,
            include_external=True,
            max_pages=10,
        ),
        verbose=True,
        wait_until="load",
        wait_for_images=False
    )

    async with AsyncWebCrawler() as crawler:
        results = await crawler.arun(
            url="https://docs.crawl4ai.com/",
            config=run_cfg
        )

    # async with AsyncWebCrawler() as crawler:
    #     results = await crawler.arun_many(
    #         urls=["https://docs.crawl4ai.com/", "https://github.com/unclecode/crawl4ai"],
    #         config=run_cfg
    #     )

        print(f"\nāœ… succussed crawling {len(results)} pages\n")

        for result in results:
            url = result.url
            depth = result.metadata.get("depth", 0)
            print("—" * 60)
            print(f"URL   : {url}")
            print(f"Depth : {depth}")

        print("—" * 60)

if __name__ == "__main__":
    asyncio.run(main())

OS

macOS

Python version

3.10.18

Browser

No response

Browser version

No response

Error logs & Screenshots (if applicable)

[INIT].... → Crawl4AI 0.7.1 [FETCH]... ↓ https://docs.crawl4ai.com/ | āœ“ | ā±: 10.28s [SCRAPE].. ā—† https://docs.crawl4ai.com/ | āœ“ | ā±: 0.03s [COMPLETE] ā— https://docs.crawl4ai.com/ | āœ“ | ā±: 10.32s

āœ… succussed crawling 1 pages

———————————————————————————————————————————————————————————— URL : https://docs.crawl4ai.com/ Depth : 0 ————————————————————————————————————————————————————————————

Above is the execution result. The dfs always shows that 1 page is crawled, not 10.

But I found a potential way to resolved this bug. In the dfs_strategy.py, the reason DFSDeepCrawlStrategy always returns only one page is due to this line in the original source code:

await self.link_discovery(result, url, depth, visited, new_links, depths)

Here, the global visited set is passed directly into link_discovery. This prematurely marks URLs discovered (but not yet visited) as "visited," causing most links to be filtered out immediately. As a result, the crawler stack quickly becomes empty after the first page.

I introduced a separate, local discovered set each time links are discovered from a page, leaving the global visited untouched. The correct approach looks like this:

new_links: List[Tuple[str, Optional[str]]] = []
discovered: Set[str] = set()  # local set for per-page deduplication
await self.link_discovery(result, url, depth, discovered, new_links, depths)

for new_url, new_parent in reversed(new_links):
    new_depth = depths.get(new_url, depth + 1)
    stack.append((new_url, new_parent, new_depth))

This change ensures only duplicate links within the same page are filtered out, while still allowing the crawler to visit all new discovered links properly.

By the way, to ensure the total number of crawled pages is correctly counted, we also need to check whether max_pages has been reached at the beginning of the while loop. This logic is missing in dfs_strategy.py as well. By adding this if check at the top of the loop in the same file, we can correctly control the crawl limit. Now if we run the code again, the result will be as expected.

            while stack and not self._cancel_event.is_set():
                # check max pages here
                if self._pages_crawled >= self.max_pages:
                    self.logger.info(f"Max pages limit ({self.max_pages}) reached, stopping DFS crawl")
                    break

[INIT].... → Crawl4AI 0.7.1 [FETCH]... ↓ https://docs.crawl4ai.com/ | āœ“ | ā±: 9.10s [SCRAPE].. ā—† https://docs.crawl4ai.com/ | āœ“ | ā±: 0.02s [COMPLETE] ā— https://docs.crawl4ai.com/ | āœ“ | ā±: 9.12s [FETCH]... ↓ https://docs.crawl4ai.com | āœ“ | ā±: 1.69s [SCRAPE].. ā—† https://docs.crawl4ai.com | āœ“ | ā±: 0.03s [COMPLETE] ā— https://docs.crawl4ai.com | āœ“ | ā±: 1.72s [FETCH]... ↓ https://docs.crawl4ai.com/core/ask-ai | āœ“ | ā±: 1.59s [SCRAPE].. ā—† https://docs.crawl4ai.com/core/ask-ai | āœ“ | ā±: 0.01s [COMPLETE] ā— https://docs.crawl4ai.com/core/ask-ai | āœ“ | ā±: 1.60s [FETCH]... ↓ https://docs.crawl4ai.com/core/quickstart | āœ“ | ā±: 0.98s [SCRAPE].. ā—† https://docs.crawl4ai.com/core/quickstart | āœ“ | ā±: 0.02s [COMPLETE] ā— https://docs.crawl4ai.com/core/quickstart | āœ“ | ā±: 1.01s [FETCH]... ↓ https://docs.crawl4ai.com/core/examples | āœ“ | ā±: 0.90s [SCRAPE].. ā—† https://docs.crawl4ai.com/core/examples | āœ“ | ā±: 0.02s [COMPLETE] ā— https://docs.crawl4ai.com/core/examples | āœ“ | ā±: 0.92s [FETCH]... ↓ https://docs.crawl4ai.com/apps | āœ“ | ā±: 0.87s [SCRAPE].. ā—† https://docs.crawl4ai.com/apps | āœ“ | ā±: 0.01s [COMPLETE] ā— https://docs.crawl4ai.com/apps | āœ“ | ā±: 0.88s [FETCH]... ↓ https://docs.crawl4ai.com/apps/c4a-script | āœ“ | ā±: 1.93s [SCRAPE].. ā—† https://docs.crawl4ai.com/apps/c4a-script | āœ“ | ā±: 0.01s [COMPLETE] ā— https://docs.crawl4ai.com/apps/c4a-script | āœ“ | ā±: 1.94s [FETCH]... ↓ https://docs.crawl4ai.com/apps/llmtxt | āœ“ | ā±: 1.48s [SCRAPE].. ā—† https://docs.crawl4ai.com/apps/llmtxt | āœ“ | ā±: 0.02s [COMPLETE] ā— https://docs.crawl4ai.com/apps/llmtxt | āœ“ | ā±: 1.50s [FETCH]... ↓ https://docs.crawl4ai.com/core/installation | āœ“ | ā±: 0.96s [SCRAPE].. ā—† https://docs.crawl4ai.com/core/installation | āœ“ | ā±: 0.01s [COMPLETE] ā— https://docs.crawl4ai.com/core/installation | āœ“ | ā±: 0.98s [FETCH]... ↓ https://docs.crawl4ai.com/core/docker-deployment | āœ“ | ā±: 1.25s [SCRAPE].. ā—† https://docs.crawl4ai.com/core/docker-deployment | āœ“ | ā±: 0.04s [COMPLETE] ā— https://docs.crawl4ai.com/core/docker-deployment | āœ“ | ā±: 1.29s

āœ… succussed crawling 10 pages

———————————————————————————————————————————————————————————— URL : https://docs.crawl4ai.com/ Depth : 0 ———————————————————————————————————————————————————————————— URL : https://docs.crawl4ai.com Depth : 1 ———————————————————————————————————————————————————————————— URL : https://docs.crawl4ai.com/core/ask-ai Depth : 2 ———————————————————————————————————————————————————————————— URL : https://docs.crawl4ai.com/core/quickstart Depth : 2 ———————————————————————————————————————————————————————————— URL : https://docs.crawl4ai.com/core/examples Depth : 2 ———————————————————————————————————————————————————————————— URL : https://docs.crawl4ai.com/apps Depth : 2 ———————————————————————————————————————————————————————————— URL : https://docs.crawl4ai.com/apps/c4a-script Depth : 2 ———————————————————————————————————————————————————————————— URL : https://docs.crawl4ai.com/apps/llmtxt Depth : 2 ———————————————————————————————————————————————————————————— URL : https://docs.crawl4ai.com/core/installation Depth : 2 ———————————————————————————————————————————————————————————— URL : https://docs.crawl4ai.com/core/docker-deployment Depth : 1 ————————————————————————————————————————————————————————————

I think this is a potential way to resolve the bug in DFSCrawler, but I found another issue during further testing.

After updating dfs_strategy.py as described above, I tried using arun_many() to crawl multiple URLs concurrently. However, it only crawls the URLs I passed in and does not go deeper.

Even though the logs show that the correct URLs are being crawled, especially when I reuse the same test URL, the total number of pages crawled is always the number of urls I provided.

I believe the issue lies in the arun_many() method, because I also tested it with BFS and observed the same behavior.

Below is my test using arun_many().

    async with AsyncWebCrawler() as crawler:
        results = await crawler.arun_many(
            urls=["https://docs.crawl4ai.com/", "https://github.com/unclecode/crawl4ai"],
            config=run_cfg
        )

[INIT].... → Crawl4AI 0.7.1 [FETCH]... ↓ https://github.com/unclecode/crawl4ai | āœ“ | ā±: 6.98s [SCRAPE].. ā—† https://github.com/unclecode/crawl4ai | āœ“ | ā±: 0.13s [COMPLETE] ā— https://github.com/unclecode/crawl4ai | āœ“ | ā±: 7.12s [FETCH]... ↓ https://github.com | āœ“ | ā±: 3.17s [SCRAPE].. ā—† https://github.com | āœ“ | ā±: 0.05s [COMPLETE] ā— https://github.com | āœ“ | ā±: 3.22s [FETCH]... ↓ https://docs.crawl4ai.com/ | āœ“ | ā±: 11.34s [SCRAPE].. ā—† https://docs.crawl4ai.com/ | āœ“ | ā±: 0.01s [COMPLETE] ā— https://docs.crawl4ai.com/ | āœ“ | ā±: 11.36s [FETCH]... ↓ https://github.com/login | āœ“ | ā±: 1.15s [SCRAPE].. ā—† https://github.com/login | āœ“ | ā±: 0.01s [COMPLETE] ā— https://github.com/login | āœ“ | ā±: 1.16s [FETCH]... ↓ https://docs.crawl4ai.com | āœ“ | ā±: 2.22s [SCRAPE].. ā—† https://docs.crawl4ai.com | āœ“ | ā±: 0.03s [COMPLETE] ā— https://docs.crawl4ai.com | āœ“ | ā±: 2.25s [FETCH]... ↓ https://github.com/features/copilot | āœ“ | ā±: 1.94s [SCRAPE].. ā—† https://github.com/features/copilot | āœ“ | ā±: 0.09s [COMPLETE] ā— https://github.com/features/copilot | āœ“ | ā±: 2.03s [FETCH]... ↓ https://github.com/features/spark | āœ“ | ā±: 1.73s [SCRAPE].. ā—† https://github.com/features/spark | āœ“ | ā±: 0.05s [COMPLETE] ā— https://github.com/features/spark | āœ“ | ā±: 1.78s [FETCH]... ↓ https://docs.crawl4ai.com/core/ask-ai | āœ“ | ā±: 2.05s [SCRAPE].. ā—† https://docs.crawl4ai.com/core/ask-ai | āœ“ | ā±: 0.01s [COMPLETE] ā— https://docs.crawl4ai.com/core/ask-ai | āœ“ | ā±: 2.06s [FETCH]... ↓ https://github.com/features/models | āœ“ | ā±: 1.30s [SCRAPE].. ā—† https://github.com/features/models | āœ“ | ā±: 0.04s [COMPLETE] ā— https://github.com/features/models | āœ“ | ā±: 1.34s [FETCH]... ↓ https://docs.crawl4ai.com/core/quickstart | āœ“ | ā±: 1.47s [SCRAPE].. ā—† https://docs.crawl4ai.com/core/quickstart | āœ“ | ā±: 0.03s [COMPLETE] ā— https://docs.crawl4ai.com/core/quickstart | āœ“ | ā±: 1.49s [FETCH]... ↓ https://github.com/security/advanced-security | āœ“ | ā±: 1.69s [SCRAPE].. ā—† https://github.com/security/advanced-security | āœ“ | ā±: 0.04s [COMPLETE] ā— https://github.com/security/advanced-security | āœ“ | ā±: 1.73s

āœ… succussed crawling 2 pages

———————————————————————————————————————————————————————————— URL : https://docs.crawl4ai.com/ Depth : 0 ———————————————————————————————————————————————————————————— URL : https://github.com/unclecode/crawl4ai Depth : 0 ————————————————————————————————————————————————————————————

YJHJACK avatar Jul 29 '25 08:07 YJHJACK

I encountered the same problem

765144989 avatar Aug 08 '25 09:08 765144989

Have you solved it?

765144989 avatar Aug 08 '25 09:08 765144989

Same issue here with docker setup. Im gonna loose my patience with crawl4ai...

mhobbec avatar Aug 18 '25 18:08 mhobbec

Still facing this issue, with DFS it only crawls single provided link

kubre avatar Sep 19 '25 12:09 kubre