crawl4ai
crawl4ai copied to clipboard
[Bug]: Deep crawling is exceeding the `max_pages` parameter and continuing beyond the set limit.
crawl4ai version
0.5.0.post4
Expected Behavior
The crawler should stop after crawling 10 pages, as specified by max_pages=10. len(results) should report a maximum of 10 pages.
Current Behavior
When using AsyncWebCrawler with BestFirstCrawlingStrategy and setting max_pages=10, the crawler unexpectedly crawls more pages than specified. In my case, it crawled 17 pages instead of stopping at 10.
Is this reproducible?
Yes
Inputs Causing the Bug
I think adding filter chain is causing this bug
Steps to Reproduce
Code snippets
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.content_scraping_strategy import LXMLWebScrapingStrategy
from crawl4ai.deep_crawling import BestFirstCrawlingStrategy
from crawl4ai.deep_crawling.filters import (
FilterChain,
DomainFilter,
URLPatternFilter,
ContentTypeFilter
)
async def main():
filter_chain = FilterChain([
DomainFilter(allowed_domains=["kore.ai"]),
URLPatternFilter(patterns=["*use-cases*", "*blog*", "*research*"]),
ContentTypeFilter(allowed_types=["text/html"])
])
config = CrawlerRunConfig(
deep_crawl_strategy=BestFirstCrawlingStrategy(
max_depth=3,
include_external=False,
max_pages=10,
filter_chain=filter_chain
),
scraping_strategy=LXMLWebScrapingStrategy(),
verbose=True
)
async with AsyncWebCrawler() as crawler:
results = await crawler.arun("https://kore.ai/use-cases", config=config)
print(f"Crawled {len(results)} pages in total")
if __name__ == "__main__":
asyncio.run(main())
OS
Linux
Python version
3.9.7
Browser
Chrome
Browser version
131.0.6778.139
Error logs & Screenshots (if applicable)
@aravindkarnam
Thanks for reporting this, @Harinib-Kore
After reviewing the code based on your report, we can confirm this is indeed a bug related to how max_pages is handled within the deep crawling strategies when processing URLs in batches. Your intuition that the FilterChain wasn't the direct cause was correct.
@ntohidi @aravindkarnam
The core issue lies in the timing of the max_pages check relative to processing results from crawler.arun_many.
- Current Behavior: The check
if self._pages_crawled >= self.max_pages:typically occurs before processing a batch of URLs. The counterself._pages_crawledis then incremented within the loop handling the results of that batch. - Problem: This allows the counter to exceed the
max_pageslimit during the processing of a batch, but the crawl only stops after that batch is fully processed and before the next batch starts. This leads to the observed overshoot.
Required Fix Hint:
We need to add an additional check for the max_pages limit immediately after self._pages_crawled is incremented inside the inner result-processing loops (async for result in ... or similar) within all relevant deep crawling strategies (like BFSDeepCrawlStrategy, BestFirstCrawlingStrategy, etc.).
Implementation Steps:
- Locate the
self._pages_crawled += 1line within the result loops in each deep crawl strategy's run methods (e.g.,_arun_batch,_arun_stream,_arun_best_first). - Immediately after incrementing the counter, add a check:
if self._pages_crawled >= self.max_pages: self.logger.info(f"Max pages limit ({self.max_pages}) reached, stopping processing.") break # Exit the inner loop handling the current batch/stream - Ensure
link_discoveryis only called if the limit hasn't been reached by that specific result. Thebreakhandles subsequent results in the batch. - Apply this fix consistently across all deep crawling strategies that implement
max_pages.
This change will ensure the strategies stop processing and yielding results much closer to the specified max_pages limit.
Hi @unclecode and team. Thanks for crawl4ai, really useful library!
Just confirming the issue reported in #927 also affects max_depth in BFSDeepCrawlStrategy. I'm seeing the crawler exceed the depth limit specified.
Using version 0.5.0.post8.
My discovery strategy config looks like this:
discovery_deep_strategy = BFSDeepCrawlStrategy(
max_depth=args.page_limit, # args.page_limit set via --page-limit CLI arg
# max_pages=args.page_limit + 1, # Commented out when using max_depth
include_external=False,
filter_chain=FilterChain(
[
DomainFilter(allowed_domains=[domain]), # domain var set earlier
URLPatternFilter(patterns=[PAGINATION_URL_PATTERN]), # pattern var set earlier
]
),
)
Even setting max_depth low (e.g., with --page-limit 1 or --page-limit 5), Stage 1 crawls hundreds of pages, ignoring the limit and making discovery very slow.
Glad you found the cause related to batch processing checks. Just wanted to confirm max_depth seems affected too. Looking forward to the fix! Let me know if more info helps.
Fixed! Itβll be included in a future release.
@ntohidi any updates on when this will be released?
@Dev4011 can you check the 2025-APR-1 branch if it's working?
and sorry for my late reply, been busy with the community and maintaining as well π©π»βπ»
Sure, I can test this out π
Hi @ntohidi - I tested this out on https://github.com and https://docs.crawl4ai.com/ - it works but everytime, it stops at 1 URL less than the max page(including the start/base URL). Example: If I put max_pages=10, it stops at 9. If I put max_pages=14, it stops at 13.
@Dev4011 Closing this issue since the main issue of max_pages not being the fail safe for crawler running out of bounds is now rectified.
For the crawler stopping at one level before the max_pages. We'll take this as a different issue. Ofcourse feel free to send in a PR for this!