crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

[Bug]: Crawler Generating Broken Links from Relative Paths

Open Harinib-Kore opened this issue 6 months ago • 0 comments

crawl4ai version

0.6.2


Expected Behavior

The crawler should correctly resolve relative URLs based on the current page's location (e.g., appending relative paths to https://example.com/release-notes/ rather than the root domain).
cc: @unclecode @aravindkarnam


Current Behavior

It is forming broken links.

Example:

[FETCH]... ↓ https://example.github.io/docs/platform/settings/settings/integrations/enable-hugging-face  | ✓ | ⏱: 0.32s  
[SCRAPE].. ◆ https://example.github.io/docs/platform/settings/settings/integrations/enable-hugging-face  | ✓ | ⏱: 0.01s  
[COMPLETE] ● https://example.github.io/docs/platform/settings/settings/integrations/enable-hugging-face  | ✓ | ⏱: 0.33s  
404  
True

Is this reproducible?

Yes


Inputs Causing the Bug

<meta charset="utf-8">
<meta name="viewport" content="width=device-width,initial-scale=1">
<meta name="author" content="Content Experience Team">

<link rel="canonical" href="https://docs.example.com/platform/release-notes/recent-updates/">
<link rel="prev" href="../../getting-started/tutorials/build-code-tool/">
<link rel="next" href="../../ai-agents/multi-agent-system/">
<link rel="icon" href="../../images/platform_favicon.svg">
<meta name="generator" content="mkdocs-1.5.3, mkdocs-material-9.5.10">

Update the crawler's link resolution logic to correctly use the full current page URL as the base when resolving relative paths (similar to how browsers interpret <a href="../../...">).

📎 Example:

  • Page URL:
    https://docs.example.com/platform/release-notes/

  • Relative Link Found:
    ../../getting-started/tutorials/build-code-tool/

  • Malformed Link Generated:
    https://docs.example.com/getting-started/tutorials/build-code-tool/

  • Correct Link:
    https://docs.example.com/platform/getting-started/tutorials/build-code-tool/


Steps to Reproduce

Use a deep crawl starting from a subdirectory that contains relative links in href or link tags.


Code Snippets

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, BrowserConfig
from crawl4ai.content_scraping_strategy import LXMLWebScrapingStrategy
from crawl4ai.deep_crawling import BestFirstCrawlingStrategy
from crawl4ai.deep_crawling.filters import (
    FilterChain,
    DomainFilter,
    URLPatternFilter,
    ContentTypeFilter
)

async def run_advanced_crawler():
    browser_config = BrowserConfig(
        java_script_enabled=True,
        headless=False,
    )

    config = CrawlerRunConfig(
        deep_crawl_strategy=BestFirstCrawlingStrategy(
            max_depth=5,
            max_pages=100,
            include_external=False,
        ),
        stream=True,
        verbose=True,
        scan_full_page=True,
        scroll_delay=0.5,
        check_robots_txt=True,
    )

    results = []
    async with AsyncWebCrawler(config=browser_config) as crawler:
        async for result in await crawler.arun("https://docs.example.com/", config=config):
            results.append(result)
            print(f" {result.url}")

    print(f"Crawled {len(results)} high-value pages")

if __name__ == "__main__":
    asyncio.run(run_advanced_crawler())

Environment Details

  • OS: Linux
  • Python Version: 3.9.7
  • Browser: Chrome
  • Browser Version: 131.0.6778.139

Error Logs & Screenshots

No response

Harinib-Kore avatar May 18 '25 17:05 Harinib-Kore