crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

[Bug]: Deep crawling is exceeding the `max_pages` parameter and continuing beyond the set limit.

Open Harinib-Kore opened this issue 7 months ago β€’ 8 comments
trafficstars

crawl4ai version

0.5.0.post4

Expected Behavior

The crawler should stop after crawling 10 pages, as specified by max_pages=10. len(results) should report a maximum of 10 pages.

Current Behavior

When using AsyncWebCrawler with BestFirstCrawlingStrategy and setting max_pages=10, the crawler unexpectedly crawls more pages than specified. In my case, it crawled 17 pages instead of stopping at 10.

Is this reproducible?

Yes

Inputs Causing the Bug

I think adding filter chain is causing this bug

Steps to Reproduce


Code snippets

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.content_scraping_strategy import LXMLWebScrapingStrategy
from crawl4ai.deep_crawling import BestFirstCrawlingStrategy
from crawl4ai.deep_crawling.filters import (
    FilterChain,
    DomainFilter,
    URLPatternFilter,
    ContentTypeFilter
)

async def main():
    filter_chain = FilterChain([
        DomainFilter(allowed_domains=["kore.ai"]),
        URLPatternFilter(patterns=["*use-cases*", "*blog*", "*research*"]),
        ContentTypeFilter(allowed_types=["text/html"])
    ])
    config = CrawlerRunConfig(
        deep_crawl_strategy=BestFirstCrawlingStrategy(
            max_depth=3, 
            include_external=False,
            max_pages=10,
            filter_chain=filter_chain
        ),
        scraping_strategy=LXMLWebScrapingStrategy(),
        verbose=True
    )

    async with AsyncWebCrawler() as crawler:
        results = await crawler.arun("https://kore.ai/use-cases", config=config)
        print(f"Crawled {len(results)} pages in total")

if __name__ == "__main__":
    asyncio.run(main())

OS

Linux

Python version

3.9.7

Browser

Chrome

Browser version

131.0.6778.139

Error logs & Screenshots (if applicable)

Image

Harinib-Kore avatar Apr 02 '25 07:04 Harinib-Kore

@aravindkarnam

Harinib-Kore avatar Apr 02 '25 07:04 Harinib-Kore

Thanks for reporting this, @Harinib-Kore

After reviewing the code based on your report, we can confirm this is indeed a bug related to how max_pages is handled within the deep crawling strategies when processing URLs in batches. Your intuition that the FilterChain wasn't the direct cause was correct.

@ntohidi @aravindkarnam

The core issue lies in the timing of the max_pages check relative to processing results from crawler.arun_many.

  1. Current Behavior: The check if self._pages_crawled >= self.max_pages: typically occurs before processing a batch of URLs. The counter self._pages_crawled is then incremented within the loop handling the results of that batch.
  2. Problem: This allows the counter to exceed the max_pages limit during the processing of a batch, but the crawl only stops after that batch is fully processed and before the next batch starts. This leads to the observed overshoot.

Required Fix Hint:

We need to add an additional check for the max_pages limit immediately after self._pages_crawled is incremented inside the inner result-processing loops (async for result in ... or similar) within all relevant deep crawling strategies (like BFSDeepCrawlStrategy, BestFirstCrawlingStrategy, etc.).

Implementation Steps:

  1. Locate the self._pages_crawled += 1 line within the result loops in each deep crawl strategy's run methods (e.g., _arun_batch, _arun_stream, _arun_best_first).
  2. Immediately after incrementing the counter, add a check:
    if self._pages_crawled >= self.max_pages:
        self.logger.info(f"Max pages limit ({self.max_pages}) reached, stopping processing.")
        break # Exit the inner loop handling the current batch/stream
    
  3. Ensure link_discovery is only called if the limit hasn't been reached by that specific result. The break handles subsequent results in the batch.
  4. Apply this fix consistently across all deep crawling strategies that implement max_pages.

This change will ensure the strategies stop processing and yielding results much closer to the specified max_pages limit.

unclecode avatar Apr 02 '25 12:04 unclecode

Hi @unclecode and team. Thanks for crawl4ai, really useful library!

Just confirming the issue reported in #927 also affects max_depth in BFSDeepCrawlStrategy. I'm seeing the crawler exceed the depth limit specified.

Using version 0.5.0.post8.

My discovery strategy config looks like this:

discovery_deep_strategy = BFSDeepCrawlStrategy(
    max_depth=args.page_limit, # args.page_limit set via --page-limit CLI arg 
    # max_pages=args.page_limit + 1, # Commented out when using max_depth
    include_external=False,
    filter_chain=FilterChain(
        [
            DomainFilter(allowed_domains=[domain]), # domain var set earlier
            URLPatternFilter(patterns=[PAGINATION_URL_PATTERN]), # pattern var set earlier
        ]
    ),
)

Even setting max_depth low (e.g., with --page-limit 1 or --page-limit 5), Stage 1 crawls hundreds of pages, ignoring the limit and making discovery very slow.

Glad you found the cause related to batch processing checks. Just wanted to confirm max_depth seems affected too. Looking forward to the fix! Let me know if more info helps.

JamesN-dev avatar Apr 05 '25 01:04 JamesN-dev

Fixed! It’ll be included in a future release.

ntohidi avatar May 02 '25 10:05 ntohidi

@ntohidi any updates on when this will be released?

Dev4011 avatar May 15 '25 13:05 Dev4011

@Dev4011 can you check the 2025-APR-1 branch if it's working?

and sorry for my late reply, been busy with the community and maintaining as well πŸ‘©πŸ»β€πŸ’»

ntohidi avatar May 27 '25 08:05 ntohidi

Sure, I can test this out πŸ‘

Dev4011 avatar May 27 '25 09:05 Dev4011

Hi @ntohidi - I tested this out on https://github.com and https://docs.crawl4ai.com/ - it works but everytime, it stops at 1 URL less than the max page(including the start/base URL). Example: If I put max_pages=10, it stops at 9. If I put max_pages=14, it stops at 13.

Dev4011 avatar May 27 '25 13:05 Dev4011

@Dev4011 Closing this issue since the main issue of max_pages not being the fail safe for crawler running out of bounds is now rectified.

For the crawler stopping at one level before the max_pages. We'll take this as a different issue. Ofcourse feel free to send in a PR for this!

aravindkarnam avatar Jul 13 '25 13:07 aravindkarnam