Incorrect Conversion of Relative to Absolute Paths for href in Web Pages
The crawl4ai tool is not utilizing the provided base URL when converting relative paths to absolute paths for href attributes. Instead, it appears to be using some other method (possibly the domain or current page URL) for path resolution. This results in incorrect absolute URLs, potentially leading to broken links or inaccurate navigation within the crawled content.
@yulin0629 I would appreciate an example as I expect this to improve over time. If you can provide one example of your input, what you're getting, and what you expect, it would be very helpful. Thank you.
This example returns wrong URLs:
import crawl4ai
async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://www.metabase.com/docs/latest/embedding/interactive-embedding",
)
print(result.markdown)
if __name__ == "__main__":
asyncio.run(main())
The result looks similar to:
[ ](https://www.metabase.com/docs/latest/embedding/</>)
Product
###### [Self-service analytics Business intelligence for everyone ](https://www.metabase.com/docs/latest/embedding/</product/business-analytics>) [  Embedded analytics Fast, flexible customer-facing analytics ](https://www.metabase.com/docs/latest/embedding/</product/embedded-analytics>)
Metabase Plans [ Starter and Open Source ](https://www.metabase.com/docs/latest/embedding/</product/starter>) [ Pro ](https://www.metabase.com/docs/latest/embedding/</product/pro>) [ Enterprise ](https://www.metabase.com/docs/latest/embedding/</product/enterprise>)
Platform [ Data Sources ](https://www.metabase.com/docs/latest/embedding/</data_sources/>) [ Security ](https://www.metabase.com/docs/latest/embedding/</security/>) [ Cloud ](https://www.metabase.com/docs/latest/embedding/</cloud/>)
[ Watch a 5-minute demo to see how to build a dashboard ](https://www.metabase.com/docs/latest/embedding/</demo/>)
Features
...
I narrowed down, that urlparse.urljoin receives wrong inputs:
base: https://www.metabase.com/docs/latest/embedding/interactive-embedding
url: </product/business-analytics>
and link_url receives wrong a.href values:
a: {'href': '</product/business-analytics>'}
Relative URLs are now handled in newer versions.
which version? I'm still having this issue in version 0.6.3
Same here. still seeing the issue in 0.6.3.