crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

[Bug]: deep crawl crawls same url multiple times

Open eliaweiss opened this issue 8 months ago • 12 comments

crawl4ai version

0.5.0.post4

Expected Behavior

should not crawl the same url

Current Behavior

i see in the log the urls are crawled multiple times:

NFO:crawlApp:Processed page 1: https://out-door.co.il [FETCH]... ↓ https://out-door.co.il/... | Status: True | Time: 3.48s [SCRAPE].. ◆ https://out-door.co.il/... | Time: 0.237s [COMPLETE] ● https://out-door.co.il/... | Status: True | Total: 3.72s .... [FETCH]... ↓ https://out-door.co.il/... | Status: True | Time: 25.25s [SCRAPE].. ◆ https://out-door.co.il/... | Time: 0.34s [COMPLETE] ● https://out-door.co.il/... | Status: True | Total: 25.60s INFO:app.process_crawl:✅ Result already saved: https://out-door.co.il/

Is this reproducible?

Yes

Inputs Causing the Bug


Steps to Reproduce


Code snippets


OS

ubuntu

Python version

3.12.3

Browser

No response

Browser version

No response

Error logs & Screenshots (if applicable)

No response

eliaweiss avatar Mar 16 '25 23:03 eliaweiss

was able to mitigate this using this URLFilter


from crawl4ai import URLFilter

class FirstTimeURLFilter(URLFilter):
    """Filter that accepts a URL the first time it is seen and rejects subsequent occurrences."""

    __slots__ = ("seen_urls",)

    def __init__(self):
        super().__init__(name="FirstTimeURLFilter")
        self.seen_urls = set()

    def apply(self, url: str) -> bool:
        if url in self.seen_urls:
            # print(f"❌  URL already seen: {url}")
            self._update_stats(False)
            return False
        else:
            # print(f"✅ URL not seen: {url}")
            self.seen_urls.add(url)
            self._update_stats(True)
            return True

eliaweiss avatar Mar 17 '25 11:03 eliaweiss

@eliaweiss This issue has already been identified and patched in this commit #f78c46446ba. It's already available in next branch and already part of 0.5.0.post4, so perhaps update and try again.

Just to be sure. Can you share a bunch of urls which it crawled over and over again. For example what's the values in self.seen_urls in the "FirstTimeURLFilter" filter.

aravindkarnam avatar Mar 17 '25 12:03 aravindkarnam

@aravindkarnam I don't think it was fixed since I'm using 0.5.0.post4

here is a code that reproduce it:


import asyncio
from hashlib import md5
import os
from crawl4ai import AsyncWebCrawler, CacheMode, CrawlerRunConfig
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
from crawl4ai.content_scraping_strategy import LXMLWebScrapingStrategy
from crawl4ai.deep_crawling.filters import (
    FilterChain,
    URLPatternFilter,
)

async def process_result(result):
    print(f"URL: {result.url}")
    print(f"Depth: {result.metadata.get('depth', 0)}")

    # # use md5 to create a unique filename
    # filename = md5(result.url.encode()).hexdigest()
    # # save to file
    # with open(f"results/{filename}.md", "w") as f:
    #     f.write(result.markdown)


async def main():
    # Create a filter to exclude URLs ending with image file extensions
    image_filter = URLPatternFilter(patterns=["*.jpg", "*.jpeg", "*.png", "*.gif", "*.bmp"])
    # Create a filter chain that uses the image filter
    filter_chain = FilterChain(filters=[image_filter])
    # Configure a 2-level deep crawl
    config = CrawlerRunConfig(
        cache_mode=CacheMode.ENABLED,
        # semaphore_count=1,
        deep_crawl_strategy=BFSDeepCrawlStrategy(
            max_depth=10,
            include_external=False,
            # Maximum number of pages to crawl (optional)
            max_pages=500,
            filter_chain=filter_chain
        ),
        scraping_strategy=LXMLWebScrapingStrategy(),
        stream=True,  # Enable streaming
        verbose=True
    )

    # create result folder
    os.makedirs("results", exist_ok=True)
    # delete all files in the results folder
    for file in os.listdir("results"):
        os.remove(os.path.join("results", file))

    page_count = 0
    async with AsyncWebCrawler() as crawler:
        async for result in await crawler.arun("https://out-door.co.il/", config=config):
            page_count += 1
            print(f"page_count {page_count}")
            await process_result(result)

    print(f"Crawled {page_count} pages in total")

if __name__ == "__main__":
    asyncio.run(main())

eliaweiss avatar Mar 17 '25 16:03 eliaweiss

RCA

Deep crawl visits same url multiple times due to the asynchronous nature of execution and the sequence of these two steps

  1. URLs being check in visited set, when being added to the queue.
  2. URLs being added to the "visited" set, when being popped from queue for crawling.

These don't exactly happen in a sequence in parallel execution.

This will have desired functionality in sequential exec but in parallel execution, before the URL is dequeued and marked as visited, duplicates are again getting queued.

Solution

Add a URL to visited set soon as it's added to the queue, not when URL is dequeued for crawling. If due to any reason the crawling of said URL fails we already have retry with exponential backoff upto 3 times. So that will take care of retries. No need to wait until dequeuing to mark it visited. Once it's added to queue it's good as "visited", at that point. It will either succeed or fail with will be captured in result.status.

aravindkarnam avatar Mar 21 '25 08:03 aravindkarnam

I see this is in progress, but wanted to share I am also experiencing this behavior and am happy to test potential fixes.

castlenthesky avatar Mar 21 '25 13:03 castlenthesky

@castlenthesky Thanks a bunch. It's in https://github.com/unclecode/crawl4ai/tree/2025-MAR-ALPHA-1 branch. Update here if you run into any further issues while testing.

aravindkarnam avatar Mar 21 '25 13:03 aravindkarnam

@aravindkarnam

I'm new to contributing and excited to help if I can. Can you help me confirm my testing method?

  1. Clone the 2025-MAR-ALPHA-1 branch into a new folder.
  2. Created a virtual environment
  3. Ran pip install -r requirements.txt to install dependencies.
  4. Ran pip install -e . to install module
  5. Copied my code script the new repo and ran

Following the above steps, I continue to have circular crawl logic.

I have attached my script in hopes it helps.

import asyncio
import uuid
from re import Pattern
from typing import List, Union

from crawl4ai import AsyncWebCrawler, BrowserConfig, CacheMode, CrawlerRunConfig
from crawl4ai.content_scraping_strategy import LXMLWebScrapingStrategy
from crawl4ai.deep_crawling import BestFirstCrawlingStrategy
from crawl4ai.deep_crawling.filters import (
    ContentTypeFilter,
    DomainFilter,
    FilterChain,
    URLPatternFilter,
)
from crawl4ai.deep_crawling.scorers import KeywordRelevanceScorer
from src.utilities.file_management import initialize_directory


async def initiate_crawl(
    target_url: str,
    target_directory: str,
    browser_profile: str = "default",
    css_selector: str = "*",
    excluded_tags: List[
        str
    ] = [],  # ["script", "style", "form", "header", "footer", "aside", "nav"],
    max_depth: int = 5,
    target_keywords: Union[List[str], None] = None,
    url_patterns: Union[str, List[str | Pattern]] = ["*"],
    allowed_file_types: Union[List[str], None] = None,
    allowed_domains: List[str] = [],
    blocked_domains: List[str] = ["*old.*"],
):
    # Initialize the output directory
    initialize_directory(target_directory)

    crawl_id = uuid.uuid4()
    print(f"Starting crawl with ID: {crawl_id}")

    filters = [
        DomainFilter(allowed_domains=allowed_domains, blocked_domains=blocked_domains),
        URLPatternFilter(patterns=url_patterns),
    ]
    if allowed_file_types:
        filters.append(ContentTypeFilter(allowed_types=allowed_file_types))
    filter_chain = FilterChain(filters)

    keyword_scorer = (
        KeywordRelevanceScorer(keywords=target_keywords, weight=0.7)
        if target_keywords
        else None
    )
    deep_crawl_strategy = BestFirstCrawlingStrategy(
        max_depth=max_depth,
        include_external=False,
        filter_chain=filter_chain,
        url_scorer=keyword_scorer,
    )
    config = CrawlerRunConfig(
        css_selector=css_selector,
        excluded_tags=excluded_tags,
        cache_mode=CacheMode.BYPASS,
        scraping_strategy=LXMLWebScrapingStrategy(),
        deep_crawl_strategy=deep_crawl_strategy,
        stream=True,
        verbose=True,
    )

    crawler = AsyncWebCrawler(
        config=BrowserConfig(
            headless=True,
            verbose=True,
            use_managed_browser=True,
            browser_type="chromium",
            user_data_dir=f"browser_profiles/{browser_profile}",
        )
    )

    page_count = 0
    results = []
    visited_urls = set()

    # Execute the crawl
    await crawler.start()
    async for result in await crawler.arun(target_url, config=config):
        try:
            if result.success:
                # append url and file details
                results.append(result.url)
                # check if url has alredy been visited
                if result.url in visited_urls:
                    print(f"⛔ {result.url}")
                    continue
                # log visited urls
                visited_urls.add(result.url)
                # log page count
                page_count += 1
                print(f"✅ {page_count}: {result.url}")
        except Exception as e:
            print(f"Error processing {result.url}: {e}")
    await crawler.close()

    # Analyze the results
    print(f"Crawled {len(result_list)} pages")

    print(result_list[0])


if __name__ == "__main__":
    asyncio.run(
        initiate_crawl(
            target_url="https://microsoft.github.io/autogen/stable/",  # Updated to be a list of URLs.
            target_directory="assets/autogen",
            max_depth=4,
            excluded_tags=["script", "style", "form", "header"],
            url_patterns=[
                "*microsoft.github.io/autogen/stable*"
            ],  # stable documentation only
        )
    )

castlenthesky avatar Mar 21 '25 15:03 castlenthesky

@unclecode @aravindkarnam Another issue observed is that the max_pages functionality isn't working correctly. Even after setting the limit to 10, it continues crawling beyond 10 pages during deep crawling.

Harinib-Kore avatar Apr 01 '25 20:04 Harinib-Kore

@Harinib-Kore I haven't seen this issue so far. Can you share a code example?

Also could you log this as a separate issue with all the details. Since this one is already tied to a patch, this will get closed automatically soon as the PR is merged to main. Also it helps to track these issues independently.

aravindkarnam avatar Apr 02 '25 07:04 aravindkarnam

@castlenthesky Sorry about the late reply. Missed your comment. Yes, you got that right! Try it out and let me know your finding. And thanks for taking the time to help us in testing!

aravindkarnam avatar Apr 02 '25 07:04 aravindkarnam

@Harinib-Kore I haven't seen this issue so far. Can you share a code example?

Also could you log this as a separate issue with all the details. Since this one is already tied to a patch, this will get closed automatically soon as the PR is merged to main. Also it helps to track these issues independently.

I have raised a bug , please refer this https://github.com/unclecode/crawl4ai/issues/927

Harinib-Kore avatar Apr 02 '25 07:04 Harinib-Kore

@castlenthesky Sorry about the late reply. Missed your comment. Yes, you got that right! Try it out and let me know your finding. And thanks for taking the time to help us in testing!

@aravindkarnam no worries. I was able to pull your branch and test. The issue has been resolved.

I also realized I had an issue in my implementation of allowed_domains, and allowed_urls. I wasn't implementing the url filter correctly. For anybody else having issues, I have changed my implementation to the following and resolved the url duplication (thank you @eliaweiss for pointing me in the right direction!):

# Example using dataclass config
config = CrawlConfig(
  crawl_pipeline_name="example_user_docs",
  browser_profile="default",
  target_url="https://docs.example.io/",
  target_directory="assets_test/example/site",
  max_depth=8,
  max_pages=500,
  excluded_tags=["script", "style", "form", "header"],
  allowed_domains=["docs.example.io"],
  blocked_url_patterns=[
    "*/legacy-docs.example.io/*",
    "*docs.example.io/blog/*",
    "*docs.example.io/cart/*",
    "*docs.example.io/about/*",
    "*docs.example.io/guides/labs/*",
    "*docs.example.io/example-plus/*",
  ],
)

This config gets passed to the appropriate constructors, but the url filter was key:

def build_filter_chain(config: FilterConfig) -> FilterChain:
  """Build a filter chain for URL filtering during crawling.

  Args:
      config: Filter configuration object

  Returns:
      FilterChain: Configured filter chain for crawler
  """
  filters = [
    FirstTimeURLFilter(),
    DomainFilter(allowed_domains=config.allowed_domains, blocked_domains=config.blocked_domains),
  ]

  # Add allowed URL patterns filter (whitelist)
  if config.allowed_url_patterns:
    patterns = _normalize_patterns(config.allowed_url_patterns)
    allowed_filter = URLPatternFilter(patterns=patterns)
    filters.append(allowed_filter)

  # Add blocked URL patterns filter (blacklist)
  if config.blocked_url_patterns:
    patterns = _normalize_patterns(config.blocked_url_patterns)
    blocked_filter = URLPatternFilter(patterns=patterns)

    # Create an inverse filter to block matching URLs
    class InverseFilter(URLFilter):
      def __init__(self, base_filter: URLFilter):
        super().__init__(name=f"Inverse{base_filter.name}")
        self.base_filter = base_filter

      def apply(self, url: str) -> bool:
        return not self.base_filter.apply(url)

    inverse_blocked_filter = InverseFilter(blocked_filter)
    filters.append(inverse_blocked_filter)

  if config.allowed_file_types:
    filters.append(ContentTypeFilter(allowed_types=config.allowed_file_types))

  logger.info(f"Filter chain created with {len(filters)} filters")
  return FilterChain(filters)

castlenthesky avatar Apr 04 '25 16:04 castlenthesky

There is still an issue for the case where the base domain varies.

The common case for this is www.example.com vs example.com but ports also matter too.

There is a fix for get_base_domain in #970 but the issue still remains in normalize_url_for_deep_crawl.

Unfortunately we can't just use the base domain as that might not be valid the www case, it depends on the DNS and server setup for the host. I think the best compromise is to register the base domain URL as visited but actually visit the normalised URL.

PR to fix this edge case is https://github.com/unclecode/crawl4ai/pull/994

stevenh avatar Apr 16 '25 20:04 stevenh