spider Inefficient URL scanning

Inefficient URL scanning

Open apfeltee opened this issue 4 years ago • 0 comments

The function generate_next_urls scans every page, effectively downloading and loading every page into memory.
This may not be a problem for small files, but it's completely inefficient, and makes it unnecessarily slow.

It would be wiser to limit URL scanning to html documents only, checking the response for a content-type matching "text/html", i.e,

def generate_next_urls(a_url, resp) #:nodoc:
  if resp["content-type"].match(/^text\/html/) then
    web_page = resp.body.to_s
    base_url = (web_page.scan(/base\s+href="(.*?)"/i).flatten + [a_url[0,a_url.rindex('/')]])[0]
    base_url = remove_trailing_slash(base_url)
    return web_page.scan(/href="(.*?)"/i).flatten.map{|link|
      begin
        parsed_link = Addressable::URI.parse(link)
        if parsed_link.fragment == '#'
          nil
        else
          construct_complete_url(base_url, link, parsed_link)
        end
      rescue
        nil
      end
    }.compact
  end
  # ignore non-html pages
  return []
end

Apr 18 '20 10:04 apfeltee

spider spider copied to clipboard

Inefficient URL scanning

spider
spider copied to clipboard