spider
spider copied to clipboard
Inefficient URL scanning
The function generate_next_urls
scans every page, effectively downloading and loading every page into memory.
This may not be a problem for small files, but it's completely inefficient, and makes it unnecessarily slow.
It would be wiser to limit URL scanning to html documents only, checking the response for a content-type matching "text/html", i.e,
def generate_next_urls(a_url, resp) #:nodoc:
if resp["content-type"].match(/^text\/html/) then
web_page = resp.body.to_s
base_url = (web_page.scan(/base\s+href="(.*?)"/i).flatten + [a_url[0,a_url.rindex('/')]])[0]
base_url = remove_trailing_slash(base_url)
return web_page.scan(/href="(.*?)"/i).flatten.map{|link|
begin
parsed_link = Addressable::URI.parse(link)
if parsed_link.fragment == '#'
nil
else
construct_complete_url(base_url, link, parsed_link)
end
rescue
nil
end
}.compact
end
# ignore non-html pages
return []
end