spidr icon indicating copy to clipboard operation
spidr copied to clipboard

Following redirects

Open ZackMattor opened this issue 7 years ago • 4 comments

Howdy! Just wondering if i'm implementing this right. I need to follow redirects, and there doesnt seem to be an option toggle so I tried implementing it this way. It seems to work, but would like some feedback!

Spidr.site(@url, max_depth: 2, limit: 20) do |spider|
  spider.every_redirect_page do |page|
    spider.visit_hosts << URI.parse(page.location).host
    spider.enqueue page.location
  end
end

ZackMattor avatar Nov 29 '16 20:11 ZackMattor

Seems to throw an error if the location is "index.html" or similar...

ZackMattor avatar Nov 30 '16 21:11 ZackMattor

Is the error coming from spidr or your code example? page.location grabs the Location header which may not always be absolute. Maybe try page.to_absolute(page.location)?

postmodern avatar Dec 04 '16 06:12 postmodern

Probably should add to README.

chamnap avatar Jun 29 '17 09:06 chamnap

Spidr should automatically follow redirects so the above code is redundant. The Page#each_url method converts everything yielded by Page#each_link to an absolute URL. Page#each_link in turn calls Page#each_redirect, which checks for the Location header. If you manually use page.location, it may not also be an absolute URL, so you'll need to call page.to_absolute(page.location).

I might consider adding Page#redirect_urls or Page#location_urls which would return absolute URLs for convenience.

postmodern avatar Jan 29 '22 02:01 postmodern