crawl4ai
crawl4ai copied to clipboard
[Bug]: relative urls incorrect after redirects
crawl4ai version
0.6.3
Expected Behavior
Relative links on a page I was redirected to should use the base url from the page that the browser was redirected to, not the initial page.
Example:
https://example.com/page-a redirects to https://example.com/redirect-target-page. The redirect-target-page contains relative links, e.g. `subpage. According to https://www.rfc-editor.org/rfc/rfc3986#section-4.2 this is called a relative-path reference.
Current Behavior
When constructing the absolute link from the relative-path reference, the address of page-a is used instead of the page that was redirected to. So the link looks like this: https://example.com/page-a/subpage-of-redirect-target-page but should be https://example.com/redirect-target-page/subpage-of-redirect-target-page.
Is this reproducible?
Yes
Inputs Causing the Bug
Website that redirects to another path inside the same website using e.g. `window.location.href`.
Steps to Reproduce
Please see unittest in my fork of this repo
The problem with reproduction is, that in the current version deep-crawling of a localhost-served website doesn't work. There is also a change that fixes that in my fork.
Code snippets
OS
Linux
Python version
3.13
Browser
No response
Browser version
No response
Error logs & Screenshots (if applicable)
No response