ScalaWebscraper icon indicating copy to clipboard operation
ScalaWebscraper copied to clipboard

Allow setting referrer in download request

Open matt-gardner opened this issue 11 years ago • 1 comments

Thanks for the tool, it's pretty useful. A nice addition would be the ability to set the referrer (and perhaps other variables, like user-agent) in the http request that's sent to download a particular site. Some sites don't function correctly without a correct referrer.

I'm pretty sure this just needs an additional line here that sets the referrer. I can try to do this and submit a pull request, but I'm pretty new to scala and I might handle things the wrong way (i.e., I haven't used implicits much, and this uses them pretty heavily, so I'm not that confident in my ability to do this right).

matt-gardner avatar Sep 18 '14 20:09 matt-gardner

That is indeed be a good addition which adds much needed configurability. It's been a while since I've written this code and after reading the code i think i overused implicits a bit to much and added unneeded complexity. So a solution with implicits is not necessarily the "right" solution.

Moving the jsoup configuration to an overridable method should be enough.

class WebsiteScraper extends Scraper {

  def download(jsoup: org.jsoup.helper.HttpConnection) = jsoup
    .userAgent("Mozilla")
    .followRedirects(true)
    .timeout(0)

  def downloadPage(pageUrl: String) = Future {
    new WebPage(new URL(pageUrl)) {
      doc = download(Jsoup.connect(pageUrl)).get
    }
  }
}

which can then be overridden

class CustomWebsiteScraper extends WebsiteScraper {

  override def download(jsoup: org.jsoup.helper.HttpConnection) = jsoup
    .userAgent("Mozilla")
    .followRedirects(true)
    .referrer("Referrer")
    .timeout(0)
}

and then used in a spider

new Spider {

  override implicit val scraper = new CustomWebsiteScraper

  onReceivedPage ::= { page: WebPage =>
    // Page received
  }

}.start()

This is just a suggestion and i would love to hear your ideas.

Rovak avatar Sep 19 '14 17:09 Rovak