crawl4ai
crawl4ai copied to clipboard
[Bug]: Crawler Generating Broken Links from Relative Paths
crawl4ai version
0.6.2
Expected Behavior
The crawler should correctly resolve relative URLs based on the current page's location (e.g., appending relative paths to https://example.com/release-notes/ rather than the root domain).
cc: @unclecode @aravindkarnam
Current Behavior
It is forming broken links.
Example:
[FETCH]... ↓ https://example.github.io/docs/platform/settings/settings/integrations/enable-hugging-face | ✓ | ⏱: 0.32s
[SCRAPE].. ◆ https://example.github.io/docs/platform/settings/settings/integrations/enable-hugging-face | ✓ | ⏱: 0.01s
[COMPLETE] ● https://example.github.io/docs/platform/settings/settings/integrations/enable-hugging-face | ✓ | ⏱: 0.33s
404
True
Is this reproducible?
Yes
Inputs Causing the Bug
<meta charset="utf-8">
<meta name="viewport" content="width=device-width,initial-scale=1">
<meta name="author" content="Content Experience Team">
<link rel="canonical" href="https://docs.example.com/platform/release-notes/recent-updates/">
<link rel="prev" href="../../getting-started/tutorials/build-code-tool/">
<link rel="next" href="../../ai-agents/multi-agent-system/">
<link rel="icon" href="../../images/platform_favicon.svg">
<meta name="generator" content="mkdocs-1.5.3, mkdocs-material-9.5.10">
Update the crawler's link resolution logic to correctly use the full current page URL as the base when resolving relative paths (similar to how browsers interpret <a href="../../...">).
📎 Example:
-
Page URL:
https://docs.example.com/platform/release-notes/ -
Relative Link Found:
../../getting-started/tutorials/build-code-tool/ -
Malformed Link Generated:
https://docs.example.com/getting-started/tutorials/build-code-tool/ -
Correct Link:
https://docs.example.com/platform/getting-started/tutorials/build-code-tool/
Steps to Reproduce
Use a deep crawl starting from a subdirectory that contains relative links in href or link tags.
Code Snippets
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, BrowserConfig
from crawl4ai.content_scraping_strategy import LXMLWebScrapingStrategy
from crawl4ai.deep_crawling import BestFirstCrawlingStrategy
from crawl4ai.deep_crawling.filters import (
FilterChain,
DomainFilter,
URLPatternFilter,
ContentTypeFilter
)
async def run_advanced_crawler():
browser_config = BrowserConfig(
java_script_enabled=True,
headless=False,
)
config = CrawlerRunConfig(
deep_crawl_strategy=BestFirstCrawlingStrategy(
max_depth=5,
max_pages=100,
include_external=False,
),
stream=True,
verbose=True,
scan_full_page=True,
scroll_delay=0.5,
check_robots_txt=True,
)
results = []
async with AsyncWebCrawler(config=browser_config) as crawler:
async for result in await crawler.arun("https://docs.example.com/", config=config):
results.append(result)
print(f" {result.url}")
print(f"Crawled {len(results)} high-value pages")
if __name__ == "__main__":
asyncio.run(run_advanced_crawler())
Environment Details
- OS: Linux
- Python Version: 3.9.7
- Browser: Chrome
- Browser Version: 131.0.6778.139
Error Logs & Screenshots
No response