spidr
spidr copied to clipboard
Following redirects
Howdy! Just wondering if i'm implementing this right. I need to follow redirects, and there doesnt seem to be an option toggle so I tried implementing it this way. It seems to work, but would like some feedback!
Spidr.site(@url, max_depth: 2, limit: 20) do |spider|
spider.every_redirect_page do |page|
spider.visit_hosts << URI.parse(page.location).host
spider.enqueue page.location
end
end
Seems to throw an error if the location is "index.html" or similar...
Is the error coming from spidr or your code example? page.location
grabs the Location
header which may not always be absolute. Maybe try page.to_absolute(page.location)
?
Probably should add to README.
Spidr should automatically follow redirects so the above code is redundant. The Page#each_url
method converts everything yielded by Page#each_link
to an absolute URL. Page#each_link
in turn calls Page#each_redirect
, which checks for the Location
header. If you manually use page.location
, it may not also be an absolute URL, so you'll need to call page.to_absolute(page.location)
.
I might consider adding Page#redirect_urls
or Page#location_urls
which would return absolute URLs for convenience.