crawl4ai
crawl4ai copied to clipboard
[Bug]: deep crawl crawls same url multiple times
crawl4ai version
0.5.0.post4
Expected Behavior
should not crawl the same url
Current Behavior
i see in the log the urls are crawled multiple times:
NFO:crawlApp:Processed page 1: https://out-door.co.il [FETCH]... ↓ https://out-door.co.il/... | Status: True | Time: 3.48s [SCRAPE].. ◆ https://out-door.co.il/... | Time: 0.237s [COMPLETE] ● https://out-door.co.il/... | Status: True | Total: 3.72s .... [FETCH]... ↓ https://out-door.co.il/... | Status: True | Time: 25.25s [SCRAPE].. ◆ https://out-door.co.il/... | Time: 0.34s [COMPLETE] ● https://out-door.co.il/... | Status: True | Total: 25.60s INFO:app.process_crawl:✅ Result already saved: https://out-door.co.il/
Is this reproducible?
Yes
Inputs Causing the Bug
Steps to Reproduce
Code snippets
OS
ubuntu
Python version
3.12.3
Browser
No response
Browser version
No response
Error logs & Screenshots (if applicable)
No response
was able to mitigate this using this URLFilter
from crawl4ai import URLFilter
class FirstTimeURLFilter(URLFilter):
"""Filter that accepts a URL the first time it is seen and rejects subsequent occurrences."""
__slots__ = ("seen_urls",)
def __init__(self):
super().__init__(name="FirstTimeURLFilter")
self.seen_urls = set()
def apply(self, url: str) -> bool:
if url in self.seen_urls:
# print(f"❌ URL already seen: {url}")
self._update_stats(False)
return False
else:
# print(f"✅ URL not seen: {url}")
self.seen_urls.add(url)
self._update_stats(True)
return True
@eliaweiss This issue has already been identified and patched in this commit #f78c46446ba. It's already available in next branch and already part of 0.5.0.post4, so perhaps update and try again.
Just to be sure. Can you share a bunch of urls which it crawled over and over again. For example what's the values in self.seen_urls in the "FirstTimeURLFilter" filter.
@aravindkarnam I don't think it was fixed since I'm using 0.5.0.post4
here is a code that reproduce it:
import asyncio
from hashlib import md5
import os
from crawl4ai import AsyncWebCrawler, CacheMode, CrawlerRunConfig
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
from crawl4ai.content_scraping_strategy import LXMLWebScrapingStrategy
from crawl4ai.deep_crawling.filters import (
FilterChain,
URLPatternFilter,
)
async def process_result(result):
print(f"URL: {result.url}")
print(f"Depth: {result.metadata.get('depth', 0)}")
# # use md5 to create a unique filename
# filename = md5(result.url.encode()).hexdigest()
# # save to file
# with open(f"results/{filename}.md", "w") as f:
# f.write(result.markdown)
async def main():
# Create a filter to exclude URLs ending with image file extensions
image_filter = URLPatternFilter(patterns=["*.jpg", "*.jpeg", "*.png", "*.gif", "*.bmp"])
# Create a filter chain that uses the image filter
filter_chain = FilterChain(filters=[image_filter])
# Configure a 2-level deep crawl
config = CrawlerRunConfig(
cache_mode=CacheMode.ENABLED,
# semaphore_count=1,
deep_crawl_strategy=BFSDeepCrawlStrategy(
max_depth=10,
include_external=False,
# Maximum number of pages to crawl (optional)
max_pages=500,
filter_chain=filter_chain
),
scraping_strategy=LXMLWebScrapingStrategy(),
stream=True, # Enable streaming
verbose=True
)
# create result folder
os.makedirs("results", exist_ok=True)
# delete all files in the results folder
for file in os.listdir("results"):
os.remove(os.path.join("results", file))
page_count = 0
async with AsyncWebCrawler() as crawler:
async for result in await crawler.arun("https://out-door.co.il/", config=config):
page_count += 1
print(f"page_count {page_count}")
await process_result(result)
print(f"Crawled {page_count} pages in total")
if __name__ == "__main__":
asyncio.run(main())
RCA
Deep crawl visits same url multiple times due to the asynchronous nature of execution and the sequence of these two steps
- URLs being check in visited set, when being added to the queue.
- URLs being added to the "visited" set, when being popped from queue for crawling.
These don't exactly happen in a sequence in parallel execution.
This will have desired functionality in sequential exec but in parallel execution, before the URL is dequeued and marked as visited, duplicates are again getting queued.
Solution
Add a URL to visited set soon as it's added to the queue, not when URL is dequeued for crawling. If due to any reason the crawling of said URL fails we already have retry with exponential backoff upto 3 times. So that will take care of retries. No need to wait until dequeuing to mark it visited. Once it's added to queue it's good as "visited", at that point. It will either succeed or fail with will be captured in result.status.
I see this is in progress, but wanted to share I am also experiencing this behavior and am happy to test potential fixes.
@castlenthesky Thanks a bunch. It's in https://github.com/unclecode/crawl4ai/tree/2025-MAR-ALPHA-1 branch. Update here if you run into any further issues while testing.
@aravindkarnam
I'm new to contributing and excited to help if I can. Can you help me confirm my testing method?
- Clone the 2025-MAR-ALPHA-1 branch into a new folder.
- Created a virtual environment
- Ran
pip install -r requirements.txtto install dependencies. - Ran
pip install -e .to install module - Copied my code script the new repo and ran
Following the above steps, I continue to have circular crawl logic.
I have attached my script in hopes it helps.
import asyncio
import uuid
from re import Pattern
from typing import List, Union
from crawl4ai import AsyncWebCrawler, BrowserConfig, CacheMode, CrawlerRunConfig
from crawl4ai.content_scraping_strategy import LXMLWebScrapingStrategy
from crawl4ai.deep_crawling import BestFirstCrawlingStrategy
from crawl4ai.deep_crawling.filters import (
ContentTypeFilter,
DomainFilter,
FilterChain,
URLPatternFilter,
)
from crawl4ai.deep_crawling.scorers import KeywordRelevanceScorer
from src.utilities.file_management import initialize_directory
async def initiate_crawl(
target_url: str,
target_directory: str,
browser_profile: str = "default",
css_selector: str = "*",
excluded_tags: List[
str
] = [], # ["script", "style", "form", "header", "footer", "aside", "nav"],
max_depth: int = 5,
target_keywords: Union[List[str], None] = None,
url_patterns: Union[str, List[str | Pattern]] = ["*"],
allowed_file_types: Union[List[str], None] = None,
allowed_domains: List[str] = [],
blocked_domains: List[str] = ["*old.*"],
):
# Initialize the output directory
initialize_directory(target_directory)
crawl_id = uuid.uuid4()
print(f"Starting crawl with ID: {crawl_id}")
filters = [
DomainFilter(allowed_domains=allowed_domains, blocked_domains=blocked_domains),
URLPatternFilter(patterns=url_patterns),
]
if allowed_file_types:
filters.append(ContentTypeFilter(allowed_types=allowed_file_types))
filter_chain = FilterChain(filters)
keyword_scorer = (
KeywordRelevanceScorer(keywords=target_keywords, weight=0.7)
if target_keywords
else None
)
deep_crawl_strategy = BestFirstCrawlingStrategy(
max_depth=max_depth,
include_external=False,
filter_chain=filter_chain,
url_scorer=keyword_scorer,
)
config = CrawlerRunConfig(
css_selector=css_selector,
excluded_tags=excluded_tags,
cache_mode=CacheMode.BYPASS,
scraping_strategy=LXMLWebScrapingStrategy(),
deep_crawl_strategy=deep_crawl_strategy,
stream=True,
verbose=True,
)
crawler = AsyncWebCrawler(
config=BrowserConfig(
headless=True,
verbose=True,
use_managed_browser=True,
browser_type="chromium",
user_data_dir=f"browser_profiles/{browser_profile}",
)
)
page_count = 0
results = []
visited_urls = set()
# Execute the crawl
await crawler.start()
async for result in await crawler.arun(target_url, config=config):
try:
if result.success:
# append url and file details
results.append(result.url)
# check if url has alredy been visited
if result.url in visited_urls:
print(f"⛔ {result.url}")
continue
# log visited urls
visited_urls.add(result.url)
# log page count
page_count += 1
print(f"✅ {page_count}: {result.url}")
except Exception as e:
print(f"Error processing {result.url}: {e}")
await crawler.close()
# Analyze the results
print(f"Crawled {len(result_list)} pages")
print(result_list[0])
if __name__ == "__main__":
asyncio.run(
initiate_crawl(
target_url="https://microsoft.github.io/autogen/stable/", # Updated to be a list of URLs.
target_directory="assets/autogen",
max_depth=4,
excluded_tags=["script", "style", "form", "header"],
url_patterns=[
"*microsoft.github.io/autogen/stable*"
], # stable documentation only
)
)
@unclecode @aravindkarnam Another issue observed is that the max_pages functionality isn't working correctly. Even after setting the limit to 10, it continues crawling beyond 10 pages during deep crawling.
@Harinib-Kore I haven't seen this issue so far. Can you share a code example?
Also could you log this as a separate issue with all the details. Since this one is already tied to a patch, this will get closed automatically soon as the PR is merged to main. Also it helps to track these issues independently.
@castlenthesky Sorry about the late reply. Missed your comment. Yes, you got that right! Try it out and let me know your finding. And thanks for taking the time to help us in testing!
@Harinib-Kore I haven't seen this issue so far. Can you share a code example?
Also could you log this as a separate issue with all the details. Since this one is already tied to a patch, this will get closed automatically soon as the PR is merged to main. Also it helps to track these issues independently.
I have raised a bug , please refer this https://github.com/unclecode/crawl4ai/issues/927
@castlenthesky Sorry about the late reply. Missed your comment. Yes, you got that right! Try it out and let me know your finding. And thanks for taking the time to help us in testing!
@aravindkarnam no worries. I was able to pull your branch and test. The issue has been resolved.
I also realized I had an issue in my implementation of allowed_domains, and allowed_urls. I wasn't implementing the url filter correctly. For anybody else having issues, I have changed my implementation to the following and resolved the url duplication (thank you @eliaweiss for pointing me in the right direction!):
# Example using dataclass config
config = CrawlConfig(
crawl_pipeline_name="example_user_docs",
browser_profile="default",
target_url="https://docs.example.io/",
target_directory="assets_test/example/site",
max_depth=8,
max_pages=500,
excluded_tags=["script", "style", "form", "header"],
allowed_domains=["docs.example.io"],
blocked_url_patterns=[
"*/legacy-docs.example.io/*",
"*docs.example.io/blog/*",
"*docs.example.io/cart/*",
"*docs.example.io/about/*",
"*docs.example.io/guides/labs/*",
"*docs.example.io/example-plus/*",
],
)
This config gets passed to the appropriate constructors, but the url filter was key:
def build_filter_chain(config: FilterConfig) -> FilterChain:
"""Build a filter chain for URL filtering during crawling.
Args:
config: Filter configuration object
Returns:
FilterChain: Configured filter chain for crawler
"""
filters = [
FirstTimeURLFilter(),
DomainFilter(allowed_domains=config.allowed_domains, blocked_domains=config.blocked_domains),
]
# Add allowed URL patterns filter (whitelist)
if config.allowed_url_patterns:
patterns = _normalize_patterns(config.allowed_url_patterns)
allowed_filter = URLPatternFilter(patterns=patterns)
filters.append(allowed_filter)
# Add blocked URL patterns filter (blacklist)
if config.blocked_url_patterns:
patterns = _normalize_patterns(config.blocked_url_patterns)
blocked_filter = URLPatternFilter(patterns=patterns)
# Create an inverse filter to block matching URLs
class InverseFilter(URLFilter):
def __init__(self, base_filter: URLFilter):
super().__init__(name=f"Inverse{base_filter.name}")
self.base_filter = base_filter
def apply(self, url: str) -> bool:
return not self.base_filter.apply(url)
inverse_blocked_filter = InverseFilter(blocked_filter)
filters.append(inverse_blocked_filter)
if config.allowed_file_types:
filters.append(ContentTypeFilter(allowed_types=config.allowed_file_types))
logger.info(f"Filter chain created with {len(filters)} filters")
return FilterChain(filters)
There is still an issue for the case where the base domain varies.
The common case for this is www.example.com vs example.com but ports also matter too.
There is a fix for get_base_domain in #970 but the issue still remains in normalize_url_for_deep_crawl.
Unfortunately we can't just use the base domain as that might not be valid the www case, it depends on the DNS and server setup for the host. I think the best compromise is to register the base domain URL as visited but actually visit the normalised URL.
PR to fix this edge case is https://github.com/unclecode/crawl4ai/pull/994