crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

Incorrect Conversion of Relative to Absolute Paths for href in Web Pages

Open yulin0629 opened this issue 1 year ago • 1 comments

The crawl4ai tool is not utilizing the provided base URL when converting relative paths to absolute paths for href attributes. Instead, it appears to be using some other method (possibly the domain or current page URL) for path resolution. This results in incorrect absolute URLs, potentially leading to broken links or inaccurate navigation within the crawled content.

yulin0629 avatar Nov 05 '24 09:11 yulin0629

@yulin0629 I would appreciate an example as I expect this to improve over time. If you can provide one example of your input, what you're getting, and what you expect, it would be very helpful. Thank you.

unclecode avatar Nov 06 '24 06:11 unclecode

This example returns wrong URLs:

import crawl4ai

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://www.metabase.com/docs/latest/embedding/interactive-embedding",
        )
        print(result.markdown)

if __name__ == "__main__":
    asyncio.run(main())

The result looks similar to:

[ ](https://www.metabase.com/docs/latest/embedding/</>)
Product 
###### [Self-service analytics Business intelligence for everyone ](https://www.metabase.com/docs/latest/embedding/</product/business-analytics>) [ ![Embed](https://www.metabase.com/images/navigation-header/embed.svg) Embedded analytics Fast, flexible customer-facing analytics ](https://www.metabase.com/docs/latest/embedding/</product/embedded-analytics>)
Metabase Plans [ Starter and Open Source  ](https://www.metabase.com/docs/latest/embedding/</product/starter>) [ Pro  ](https://www.metabase.com/docs/latest/embedding/</product/pro>) [ Enterprise  ](https://www.metabase.com/docs/latest/embedding/</product/enterprise>)
Platform [ Data Sources  ](https://www.metabase.com/docs/latest/embedding/</data_sources/>) [ Security  ](https://www.metabase.com/docs/latest/embedding/</security/>) [ Cloud  ](https://www.metabase.com/docs/latest/embedding/</cloud/>)
[ Watch a 5-minute demo to see how to build a dashboard  ](https://www.metabase.com/docs/latest/embedding/</demo/>)
Features 
...

I narrowed down, that urlparse.urljoin receives wrong inputs:

base:  https://www.metabase.com/docs/latest/embedding/interactive-embedding
url: </product/business-analytics>

and link_url receives wrong a.href values:

a:  {'href': '</product/business-analytics>'}

mindaugasrukas avatar Jan 21 '25 19:01 mindaugasrukas

Relative URLs are now handled in newer versions.

aravindkarnam avatar Jan 22 '25 13:01 aravindkarnam

which version? I'm still having this issue in version 0.6.3

blasco avatar Jun 06 '25 15:06 blasco

Same here. still seeing the issue in 0.6.3.

lrhazi-cua avatar Jun 16 '25 19:06 lrhazi-cua