crawl4ai [Bug]: relative urls incorrect after redirects

[Bug]: relative urls incorrect after redirects

Open 0xC0DEBA5E opened this issue 5 months ago • 1 comments

crawl4ai version

0.6.3

Expected Behavior

Relative links on a page I was redirected to should use the base url from the page that the browser was redirected to, not the initial page.

Example:

https://example.com/page-a redirects to https://example.com/redirect-target-page. The redirect-target-page contains relative links, e.g. `subpage. According to https://www.rfc-editor.org/rfc/rfc3986#section-4.2 this is called a relative-path reference.

Current Behavior

When constructing the absolute link from the relative-path reference, the address of page-a is used instead of the page that was redirected to. So the link looks like this: https://example.com/page-a/subpage-of-redirect-target-page but should be https://example.com/redirect-target-page/subpage-of-redirect-target-page.

Is this reproducible?

Yes

Inputs Causing the Bug

Website that redirects to another path inside the same website using e.g. `window.location.href`.

Steps to Reproduce

Please see unittest in my fork of this repo

The problem with reproduction is, that in the current version deep-crawling of a localhost-served website doesn't work. There is also a change that fixes that in my fork.

Code snippets

OS

Linux

Python version

3.13

Browser

No response

Browser version

No response

Error logs & Screenshots (if applicable)

No response

Jul 03 '25 15:07 0xC0DEBA5E

crawl4ai crawl4ai copied to clipboard

[Bug]: relative urls incorrect after redirects

crawl4ai version

Expected Behavior

Current Behavior

Is this reproducible?

Inputs Causing the Bug

Steps to Reproduce

Code snippets

OS

Python version

Browser

Browser version

Error logs & Screenshots (if applicable)

crawl4ai
crawl4ai copied to clipboard