crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

[Bug]: relative urls incorrect after redirects

Open 0xC0DEBA5E opened this issue 5 months ago • 1 comments

crawl4ai version

0.6.3

Expected Behavior

Relative links on a page I was redirected to should use the base url from the page that the browser was redirected to, not the initial page.

Example:

https://example.com/page-a redirects to https://example.com/redirect-target-page. The redirect-target-page contains relative links, e.g. `subpage. According to https://www.rfc-editor.org/rfc/rfc3986#section-4.2 this is called a relative-path reference.

Current Behavior

When constructing the absolute link from the relative-path reference, the address of page-a is used instead of the page that was redirected to. So the link looks like this: https://example.com/page-a/subpage-of-redirect-target-page but should be https://example.com/redirect-target-page/subpage-of-redirect-target-page.

Is this reproducible?

Yes

Inputs Causing the Bug

Website that redirects to another path inside the same website using e.g. `window.location.href`.

Steps to Reproduce

Please see unittest in my fork of this repo

The problem with reproduction is, that in the current version deep-crawling of a localhost-served website doesn't work. There is also a change that fixes that in my fork.

Code snippets


OS

Linux

Python version

3.13

Browser

No response

Browser version

No response

Error logs & Screenshots (if applicable)

No response

0xC0DEBA5E avatar Jul 03 '25 15:07 0xC0DEBA5E