spidr icon indicating copy to clipboard operation
spidr copied to clipboard

Support `<base>` field for relative urls

Open zealot128 opened this issue 4 months ago • 5 comments

Recently, I stumbled upon a handful of CMS' that do not generate absolute urls in their hrefs, but ALWAYS use relative links, combined with a base href at the head:

https://developer.mozilla.org/en-US/docs/Web/HTML/Reference/Elements/base

<base href='https://mysite.com'>
<a href='about_us'>

Spidr does not use base. So, when we finding the same link on the same subsite, it will generate: "about_us/about_us" which usually 404

zealot128 avatar Aug 18 '25 11:08 zealot128

I fixed it in our company fork: https://github.com/zealot128-os/spidr/commit/d469ce832c031d4f5079fe3a2314ceec63785e72

Not sure if PR for this is desired? I used Claude mostly to create test + implementation with minor adjustments after providing the HTTP spec context.

zealot128 avatar Aug 18 '25 14:08 zealot128

Could you give a real-world example of how <base> would be used? I'm trying to imagine the scenario where a webpage is accessible via one URL which Spidr followed, but you want all relative links to be resolved using the <base> href URL, which might be different from the original URL of the webpage.

postmodern avatar Aug 18 '25 19:08 postmodern

Could you give a real-world example of how <base> would be used? I'm trying to imagine the scenario where a webpage is accessible via one URL which Spidr followed, but you want all relative links to be resolved using the <base> href URL, which might be different from the original URL of the webpage.

I also don't know why you would ever need that in a SaaS app, but in a CMS theme or so, it might be helpful, so you can 'hardcode' urls, but the site later can be deployed on blog.mycompany.com, or mycompany.com/blog, and the links will all work even with a configured sub path.

When building job crawlers for customers, I found it a couple of times already, there might be some CMS like Typo3 or so, that use this pattern more often.

Example on this current website:

Links in the footer are relative:

<a href='privacy.html'>Privacy</a>

Then, on /privacy.html the same footer is rendered, so Spidr without base-handling will pick up "privacy/privacy.html" and traverse infinitely (to max depth).

zealot128 avatar Aug 19 '25 08:08 zealot128

OK I think I see how <base> is being used here. Instead of using explicit fully qualified links, they are using <base> to DRY-up the links and provide an alternate base URL. Although, it does not make sense that the original webpage can be requested from a URL with a different base than the <base> href.

Then, on /privacy.html the same footer is rendered, so Spidr without base-handling will pick up "privacy/privacy.html" and traverse infinitely (to max depth).

That should not happen. Spidr::Page#to_absolute(relative_link) uses URI::HTTP#merge which will replace the ending file name with the new relative link.

url = URI.parse('https://example.com/privacy.html')
response = Net::HTTPS.get_response(url)
page = Spidr::Page.new(url, response)

page.to_absolute('privacy.html')
# => #<URI::HTTPS https://example.com/privacy.html>

The page URL would need to be https://example.com/privacy/ for a privacy.html relative link to expand to https://example.com/privacy/privacy.html. Even then, linking to privacy.html from https://example.com/privacy/privacy.html would still result in a https://example.com/privacy/privacy.html link (which would be ignored since it was already visited), not infinitely appending privacy or privacy.html to the link.

postmodern avatar Aug 19 '25 19:08 postmodern

Ofc you are right, my example was not correct. I already forgot which company's website I had crawled recently and made up a bad example. Here the correct one:

https://www.promed-verbindet.de/jobs/kauffrau-kaufmann-m-w-d-im-gesundheitswesen

That link in the footer is just "privacy" and resolves in my Firefox to /privacy. Spidr without base will use "/jobs/privacy" which 404 in this case (but there are worse CMS that will resolve the site, present a search etc. - then the indef. loop starts!)

All major browser respect Base-Tag, so the web developers that implement that will see no issue, but Spidr will loop indefinitely.

On master:

url = URI.parse("https://www.promed-verbindet.de/jobs/kauffrau-kaufmann-m-w-d-im-gesundheitswesen"); response = Net::HTTP.get_response(url); page = Spidr::Page.new(url, response); page.to_absolute('privacy.html')
=> #<URI::HTTPS https://www.promed-verbindet.de/jobs/privacy.html>

On my fork:

url = URI.parse("https://www.promed-verbindet.de/jobs/kauffrau-kaufmann-m-w-d-im-gesundheitswesen"); response = Net::HTTP.get_response(url); page = Spidr::Page.new(url, response); page.to_absolute('privacy.html')
=> #<URI::HTTPS https://www.promed-verbindet.de/privacy.html>

zealot128 avatar Aug 19 '25 20:08 zealot128