[Bug]: A potential fix for issue #1071, but found a new bug with arun_many() in DFSCrawl
crawl4ai version
0.7.1
Expected Behavior
Crawl all internal links in DFS order up to the specified depth and maximum number of pages, regardless of whether arun() or arun_many() is used. The behavior should be consistent whether crawling a single URL or multiple URLs concurrently.
Current Behavior
The current dfs_strategy.py implementation always returns only one crawled page, regardless of the specified depth or max_pages. I proposed a potential fix by modifying how the strategy handles link discovery, which allows DFS crawling to work correctly when using arun() on a single URL. Further details are available in the "Error Logs & Screenshots" section.
However, when using arun_many() with multiple URLs, the same issue reappears. The crawler only returns the exact number of URLs initially provided, and does not go deeper into internal links. DFS crawling is effectively limited to just the input URLs in this case.
Is this reproducible?
Yes
Inputs Causing the Bug
Steps to Reproduce
Code snippets
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.deep_crawling import DFSDeepCrawlStrategy
async def main():
run_cfg = CrawlerRunConfig(
deep_crawl_strategy=DFSDeepCrawlStrategy(
max_depth=2,
include_external=True,
max_pages=10,
),
verbose=True,
wait_until="load",
wait_for_images=False
)
async with AsyncWebCrawler() as crawler:
results = await crawler.arun(
url="https://docs.crawl4ai.com/",
config=run_cfg
)
# async with AsyncWebCrawler() as crawler:
# results = await crawler.arun_many(
# urls=["https://docs.crawl4ai.com/", "https://github.com/unclecode/crawl4ai"],
# config=run_cfg
# )
print(f"\nā
succussed crawling {len(results)} pages\n")
for result in results:
url = result.url
depth = result.metadata.get("depth", 0)
print("ā" * 60)
print(f"URL : {url}")
print(f"Depth : {depth}")
print("ā" * 60)
if __name__ == "__main__":
asyncio.run(main())
OS
macOS
Python version
3.10.18
Browser
No response
Browser version
No response
Error logs & Screenshots (if applicable)
[INIT].... ā Crawl4AI 0.7.1 [FETCH]... ā https://docs.crawl4ai.com/ | ā | ā±: 10.28s [SCRAPE].. ā https://docs.crawl4ai.com/ | ā | ā±: 0.03s [COMPLETE] ā https://docs.crawl4ai.com/ | ā | ā±: 10.32s
ā succussed crawling 1 pages
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā URL : https://docs.crawl4ai.com/ Depth : 0 āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
Above is the execution result. The dfs always shows that 1 page is crawled, not 10.
But I found a potential way to resolved this bug. In the dfs_strategy.py, the reason DFSDeepCrawlStrategy always returns only one page is due to this line in the original source code:
await self.link_discovery(result, url, depth, visited, new_links, depths)
Here, the global visited set is passed directly into link_discovery. This prematurely marks URLs discovered (but not yet visited) as "visited," causing most links to be filtered out immediately. As a result, the crawler stack quickly becomes empty after the first page.
I introduced a separate, local discovered set each time links are discovered from a page, leaving the global visited untouched. The correct approach looks like this:
new_links: List[Tuple[str, Optional[str]]] = []
discovered: Set[str] = set() # local set for per-page deduplication
await self.link_discovery(result, url, depth, discovered, new_links, depths)
for new_url, new_parent in reversed(new_links):
new_depth = depths.get(new_url, depth + 1)
stack.append((new_url, new_parent, new_depth))
This change ensures only duplicate links within the same page are filtered out, while still allowing the crawler to visit all new discovered links properly.
By the way, to ensure the total number of crawled pages is correctly counted, we also need to check whether max_pages has been reached at the beginning of the while loop. This logic is missing in dfs_strategy.py as well. By adding this if check at the top of the loop in the same file, we can correctly control the crawl limit. Now if we run the code again, the result will be as expected.
while stack and not self._cancel_event.is_set():
# check max pages here
if self._pages_crawled >= self.max_pages:
self.logger.info(f"Max pages limit ({self.max_pages}) reached, stopping DFS crawl")
break
[INIT].... ā Crawl4AI 0.7.1 [FETCH]... ā https://docs.crawl4ai.com/ | ā | ā±: 9.10s [SCRAPE].. ā https://docs.crawl4ai.com/ | ā | ā±: 0.02s [COMPLETE] ā https://docs.crawl4ai.com/ | ā | ā±: 9.12s [FETCH]... ā https://docs.crawl4ai.com | ā | ā±: 1.69s [SCRAPE].. ā https://docs.crawl4ai.com | ā | ā±: 0.03s [COMPLETE] ā https://docs.crawl4ai.com | ā | ā±: 1.72s [FETCH]... ā https://docs.crawl4ai.com/core/ask-ai | ā | ā±: 1.59s [SCRAPE].. ā https://docs.crawl4ai.com/core/ask-ai | ā | ā±: 0.01s [COMPLETE] ā https://docs.crawl4ai.com/core/ask-ai | ā | ā±: 1.60s [FETCH]... ā https://docs.crawl4ai.com/core/quickstart | ā | ā±: 0.98s [SCRAPE].. ā https://docs.crawl4ai.com/core/quickstart | ā | ā±: 0.02s [COMPLETE] ā https://docs.crawl4ai.com/core/quickstart | ā | ā±: 1.01s [FETCH]... ā https://docs.crawl4ai.com/core/examples | ā | ā±: 0.90s [SCRAPE].. ā https://docs.crawl4ai.com/core/examples | ā | ā±: 0.02s [COMPLETE] ā https://docs.crawl4ai.com/core/examples | ā | ā±: 0.92s [FETCH]... ā https://docs.crawl4ai.com/apps | ā | ā±: 0.87s [SCRAPE].. ā https://docs.crawl4ai.com/apps | ā | ā±: 0.01s [COMPLETE] ā https://docs.crawl4ai.com/apps | ā | ā±: 0.88s [FETCH]... ā https://docs.crawl4ai.com/apps/c4a-script | ā | ā±: 1.93s [SCRAPE].. ā https://docs.crawl4ai.com/apps/c4a-script | ā | ā±: 0.01s [COMPLETE] ā https://docs.crawl4ai.com/apps/c4a-script | ā | ā±: 1.94s [FETCH]... ā https://docs.crawl4ai.com/apps/llmtxt | ā | ā±: 1.48s [SCRAPE].. ā https://docs.crawl4ai.com/apps/llmtxt | ā | ā±: 0.02s [COMPLETE] ā https://docs.crawl4ai.com/apps/llmtxt | ā | ā±: 1.50s [FETCH]... ā https://docs.crawl4ai.com/core/installation | ā | ā±: 0.96s [SCRAPE].. ā https://docs.crawl4ai.com/core/installation | ā | ā±: 0.01s [COMPLETE] ā https://docs.crawl4ai.com/core/installation | ā | ā±: 0.98s [FETCH]... ā https://docs.crawl4ai.com/core/docker-deployment | ā | ā±: 1.25s [SCRAPE].. ā https://docs.crawl4ai.com/core/docker-deployment | ā | ā±: 0.04s [COMPLETE] ā https://docs.crawl4ai.com/core/docker-deployment | ā | ā±: 1.29s
ā succussed crawling 10 pages
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā URL : https://docs.crawl4ai.com/ Depth : 0 āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā URL : https://docs.crawl4ai.com Depth : 1 āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā URL : https://docs.crawl4ai.com/core/ask-ai Depth : 2 āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā URL : https://docs.crawl4ai.com/core/quickstart Depth : 2 āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā URL : https://docs.crawl4ai.com/core/examples Depth : 2 āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā URL : https://docs.crawl4ai.com/apps Depth : 2 āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā URL : https://docs.crawl4ai.com/apps/c4a-script Depth : 2 āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā URL : https://docs.crawl4ai.com/apps/llmtxt Depth : 2 āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā URL : https://docs.crawl4ai.com/core/installation Depth : 2 āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā URL : https://docs.crawl4ai.com/core/docker-deployment Depth : 1 āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
I think this is a potential way to resolve the bug in DFSCrawler, but I found another issue during further testing.
After updating dfs_strategy.py as described above, I tried using arun_many() to crawl multiple URLs concurrently. However, it only crawls the URLs I passed in and does not go deeper.
Even though the logs show that the correct URLs are being crawled, especially when I reuse the same test URL, the total number of pages crawled is always the number of urls I provided.
I believe the issue lies in the arun_many() method, because I also tested it with BFS and observed the same behavior.
Below is my test using arun_many().
async with AsyncWebCrawler() as crawler:
results = await crawler.arun_many(
urls=["https://docs.crawl4ai.com/", "https://github.com/unclecode/crawl4ai"],
config=run_cfg
)
[INIT].... ā Crawl4AI 0.7.1 [FETCH]... ā https://github.com/unclecode/crawl4ai | ā | ā±: 6.98s [SCRAPE].. ā https://github.com/unclecode/crawl4ai | ā | ā±: 0.13s [COMPLETE] ā https://github.com/unclecode/crawl4ai | ā | ā±: 7.12s [FETCH]... ā https://github.com | ā | ā±: 3.17s [SCRAPE].. ā https://github.com | ā | ā±: 0.05s [COMPLETE] ā https://github.com | ā | ā±: 3.22s [FETCH]... ā https://docs.crawl4ai.com/ | ā | ā±: 11.34s [SCRAPE].. ā https://docs.crawl4ai.com/ | ā | ā±: 0.01s [COMPLETE] ā https://docs.crawl4ai.com/ | ā | ā±: 11.36s [FETCH]... ā https://github.com/login | ā | ā±: 1.15s [SCRAPE].. ā https://github.com/login | ā | ā±: 0.01s [COMPLETE] ā https://github.com/login | ā | ā±: 1.16s [FETCH]... ā https://docs.crawl4ai.com | ā | ā±: 2.22s [SCRAPE].. ā https://docs.crawl4ai.com | ā | ā±: 0.03s [COMPLETE] ā https://docs.crawl4ai.com | ā | ā±: 2.25s [FETCH]... ā https://github.com/features/copilot | ā | ā±: 1.94s [SCRAPE].. ā https://github.com/features/copilot | ā | ā±: 0.09s [COMPLETE] ā https://github.com/features/copilot | ā | ā±: 2.03s [FETCH]... ā https://github.com/features/spark | ā | ā±: 1.73s [SCRAPE].. ā https://github.com/features/spark | ā | ā±: 0.05s [COMPLETE] ā https://github.com/features/spark | ā | ā±: 1.78s [FETCH]... ā https://docs.crawl4ai.com/core/ask-ai | ā | ā±: 2.05s [SCRAPE].. ā https://docs.crawl4ai.com/core/ask-ai | ā | ā±: 0.01s [COMPLETE] ā https://docs.crawl4ai.com/core/ask-ai | ā | ā±: 2.06s [FETCH]... ā https://github.com/features/models | ā | ā±: 1.30s [SCRAPE].. ā https://github.com/features/models | ā | ā±: 0.04s [COMPLETE] ā https://github.com/features/models | ā | ā±: 1.34s [FETCH]... ā https://docs.crawl4ai.com/core/quickstart | ā | ā±: 1.47s [SCRAPE].. ā https://docs.crawl4ai.com/core/quickstart | ā | ā±: 0.03s [COMPLETE] ā https://docs.crawl4ai.com/core/quickstart | ā | ā±: 1.49s [FETCH]... ā https://github.com/security/advanced-security | ā | ā±: 1.69s [SCRAPE].. ā https://github.com/security/advanced-security | ā | ā±: 0.04s [COMPLETE] ā https://github.com/security/advanced-security | ā | ā±: 1.73s
ā succussed crawling 2 pages
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā URL : https://docs.crawl4ai.com/ Depth : 0 āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā URL : https://github.com/unclecode/crawl4ai Depth : 0 āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
I encountered the same problem
Have you solved it?
Same issue here with docker setup. Im gonna loose my patience with crawl4ai...
Still facing this issue, with DFS it only crawls single provided link