Support `<base>` field for relative urls
Recently, I stumbled upon a handful of CMS' that do not generate absolute urls in their hrefs, but ALWAYS use relative links, combined with a base href at the head:
https://developer.mozilla.org/en-US/docs/Web/HTML/Reference/Elements/base
<base href='https://mysite.com'>
<a href='about_us'>
Spidr does not use base. So, when we finding the same link on the same subsite, it will generate: "about_us/about_us" which usually 404
I fixed it in our company fork: https://github.com/zealot128-os/spidr/commit/d469ce832c031d4f5079fe3a2314ceec63785e72
Not sure if PR for this is desired? I used Claude mostly to create test + implementation with minor adjustments after providing the HTTP spec context.
Could you give a real-world example of how <base> would be used? I'm trying to imagine the scenario where a webpage is accessible via one URL which Spidr followed, but you want all relative links to be resolved using the <base> href URL, which might be different from the original URL of the webpage.
Could you give a real-world example of how
<base>would be used? I'm trying to imagine the scenario where a webpage is accessible via one URL which Spidr followed, but you want all relative links to be resolved using the<base>hrefURL, which might be different from the original URL of the webpage.
I also don't know why you would ever need that in a SaaS app, but in a CMS theme or so, it might be helpful, so you can 'hardcode' urls, but the site later can be deployed on blog.mycompany.com, or mycompany.com/blog, and the links will all work even with a configured sub path.
When building job crawlers for customers, I found it a couple of times already, there might be some CMS like Typo3 or so, that use this pattern more often.
Example on this current website:
Links in the footer are relative:
<a href='privacy.html'>Privacy</a>
Then, on /privacy.html the same footer is rendered, so Spidr without base-handling will pick up "privacy/privacy.html" and traverse infinitely (to max depth).
OK I think I see how <base> is being used here. Instead of using explicit fully qualified links, they are using <base> to DRY-up the links and provide an alternate base URL. Although, it does not make sense that the original webpage can be requested from a URL with a different base than the <base> href.
Then, on /privacy.html the same footer is rendered, so Spidr without base-handling will pick up "privacy/privacy.html" and traverse infinitely (to max depth).
That should not happen. Spidr::Page#to_absolute(relative_link) uses URI::HTTP#merge which will replace the ending file name with the new relative link.
url = URI.parse('https://example.com/privacy.html')
response = Net::HTTPS.get_response(url)
page = Spidr::Page.new(url, response)
page.to_absolute('privacy.html')
# => #<URI::HTTPS https://example.com/privacy.html>
The page URL would need to be https://example.com/privacy/ for a privacy.html relative link to expand to https://example.com/privacy/privacy.html. Even then, linking to privacy.html from https://example.com/privacy/privacy.html would still result in a https://example.com/privacy/privacy.html link (which would be ignored since it was already visited), not infinitely appending privacy or privacy.html to the link.
Ofc you are right, my example was not correct. I already forgot which company's website I had crawled recently and made up a bad example. Here the correct one:
https://www.promed-verbindet.de/jobs/kauffrau-kaufmann-m-w-d-im-gesundheitswesen
That link in the footer is just "privacy" and resolves in my Firefox to /privacy. Spidr without base will use "/jobs/privacy" which 404 in this case (but there are worse CMS that will resolve the site, present a search etc. - then the indef. loop starts!)
All major browser respect Base-Tag, so the web developers that implement that will see no issue, but Spidr will loop indefinitely.
On master:
url = URI.parse("https://www.promed-verbindet.de/jobs/kauffrau-kaufmann-m-w-d-im-gesundheitswesen"); response = Net::HTTP.get_response(url); page = Spidr::Page.new(url, response); page.to_absolute('privacy.html')
=> #<URI::HTTPS https://www.promed-verbindet.de/jobs/privacy.html>
On my fork:
url = URI.parse("https://www.promed-verbindet.de/jobs/kauffrau-kaufmann-m-w-d-im-gesundheitswesen"); response = Net::HTTP.get_response(url); page = Spidr::Page.new(url, response); page.to_absolute('privacy.html')
=> #<URI::HTTPS https://www.promed-verbindet.de/privacy.html>