crawly icon indicating copy to clipboard operation
crawly copied to clipboard

[Help] Middlewares not being overridden on a Spider level

Open jfmlima opened this issue 4 years ago • 16 comments

Hey 👋 ,

I'm having an issue while overriding settings on a spider level. According to the docs doing something like:

@impl Crawly.Spider
def override_settings() do
  [
    middlewares: [CrawlyOverrides.Headers]
  ]
end

would be enough, but after debugging the request in parse_item the settings are not being overridden.

Am I missing something here?

Here's my mix.exs:

defmodule CrawlyOverrides.MixProject do
  use Mix.Project

  def project do
    [
      app: :crawly_overrides,
      version: "0.1.0",
      elixir: "~> 1.11",
      start_permanent: Mix.env() == :prod,
      deps: deps()
    ]
  end

  # Run "mix help compile.app" to learn about applications.
  def application do
    [
      extra_applications: [:logger],
      mod: {CrawlyOverrides.Application, []}
    ]
  end

  # Run "mix help deps" to learn about dependencies.
  defp deps do
    [
      {:crawly, "~> 0.11.0"},
      {:floki, "~> 0.26.0"},
    ]
  end
end

The Config:

import Config

config :crawly,
  item: [
    :id,
  ],
  middlewares: [
    Crawly.Middlewares.AutoCookiesManager,
    {Crawly.Middlewares.UserAgent, user_agents: ["My Custom Bot"] }
  ],
  pipelines: [
  ]

And Middleware:

defmodule CrawlyOverrides.Headers do
@behaviour Crawly.Pipeline

@impl Crawly.Pipeline
def run(request, state) do
  cookie =
    "new_cookie;"
  headers = request.headers ++ [cookie: cookie]

  new_request =
    Map.put(request, :headers, headers)

    {new_request, state}
  end
end

This is what I get with the middleware on a global level:

%HTTPoison.Request{
  body: "",
  headers: [{"User-Agent", "My Custom Bot"}, {:cookie, "new_cookie;"}],
  method: :get,
  options: [],
  params: %{},
  url: "https://www.homebase.co.uk/our-range/lighting-and-electrical/torches-and-nightlights/worklights"
}

And by moving it from config.exs to the Spider:

%HTTPoison.Request{
  body: "",
  headers: [{"User-Agent", "My Custom Bot"}],
  method: :get,
  options: [],
  params: %{},
  url: "https://www.homebase.co.uk/our-range/tools"
}

Any help would be appreciated here, and awesome job on Crawly!

jfmlima avatar Nov 13 '20 01:11 jfmlima

@jfmlima hi this is a bug, thanks for the report. It is caused by this line here: https://github.com/oltarasenko/crawly/blob/8c8b3651559529bcb81ec1477ade18386f794f14/lib/crawly/request.ex#L57

@oltarasenko the requests creation might need some refactoring. As there is no spider reference being passed, we can't use Crawly.Utils.get_settings(:middlewares, MySpider, default_middlewares). I think adjusting the function args to accept url + spider name + opts would be a good.

Ziinc avatar Nov 13 '20 02:11 Ziinc

Oh, indeed. So basically it not possible to override middleware because of it :(

oltarasenko avatar Nov 13 '20 07:11 oltarasenko

I need to think more about the issue. From one side, for now, you should just use https://github.com/oltarasenko/crawly/blob/8c8b3651559529bcb81ec1477ade18386f794f14/lib/crawly/request.ex#L72 to create new requests.

I am not quite sure how to address it. It's possible to override middlewares on the request storage side: https://github.com/oltarasenko/crawly/blob/8c8b3651559529bcb81ec1477ade18386f794f14/lib/crawly/requests_storage/requests_storage_worker.ex#L66 (which also knows the spider name, so we can use the function referred by @Ziinc). However it means that we're overriding middlewares manually added by user to a request, which limits possibilities here. From one side maybe it's fine, if override settings is defined, they are overriden.

oltarasenko avatar Nov 13 '20 08:11 oltarasenko

Hey @Ziinc & @oltarasenko, thanks for the quick feedback!

@oltarasenko can you provide some examples on how to manually create new requests please? Or was that suggestion not meant for me?

jfmlima avatar Nov 13 '20 16:11 jfmlima

Sorry, @jfmlima, I've managed to catch flu.

Regarding manual creation of requests: Spider's parse_item/1 function is supposed to return new items and new requests, for example this process is shown in the tutorial here: https://hexdocs.pm/crawly/0.11.0/tutorial.html#extracting-data-in-our-spider

So in this example case:

    # Convert URLs into Requests
    requests =
      urls
      |> Enum.uniq()
      |> Enum.map(&build_absolute_url/1)
      |> Enum.map(&Crawly.Utils.request_from_url/1)```

You would need to replace this last `&Crawly.Utils.request_from_url/1` and to create requests manually with: `fn url -> Crawly.Request.new(url, [], [], middlewares()) end` 

oltarasenko avatar Nov 16 '20 18:11 oltarasenko

Hey @oltarasenko, sorry to hear that, hope you're feeling better!

I've eventually figured it out and created my own utils with exactly what you suggested, it doesn't work however with the first request which is the most important one to continue scraping. I've seen that it will be allowed to pass requests as well in the start_requests here https://github.com/oltarasenko/crawly/commit/8c8b3651559529bcb81ec1477ade18386f794f14, so all good. I'll keep it global for now, and adjust it for the new release! Thanks for all your efforts here!

jfmlima avatar Nov 16 '20 18:11 jfmlima

@jfmlima I am preparing the 0.12.0 rollout just now.

oltarasenko avatar Nov 16 '20 19:11 oltarasenko

Terrific @oltarasenko, let me know if I can be of any help

jfmlima avatar Nov 16 '20 19:11 jfmlima

@jfmlima Just done. I have tested it a bit, so it should work fine. Please let me know if something goes wrong, so I will prepare a bugfix.

oltarasenko avatar Nov 16 '20 20:11 oltarasenko

Thanks buddy, will do!

jfmlima avatar Nov 16 '20 20:11 jfmlima

@oltarasenko There is probably another issue with overriding middleware options, I have to use proxy for one of the spiders, when I add Crawly.Middlewares.RequestOptions with proxy settings to main config it works fine, but overriding on spider lever doesn't, even though in iex calling Crawly.Utils.get_settings(:middlewares, FooSpider) shows settings with overrides but then spider started with Crawly.Engine.start_spider(FooSpider) seem to completely ignore proxy settings.

spectator avatar Jan 29 '21 20:01 spectator

@spectator Ok, I see. Yes as it was mentioned before we do not override middleware at all on the spider level, that's why your per-spider config is not taken into account here :(.

Could you just add a proxy to your requests when requests are created by the spider (in the parse_item function)?

oltarasenko avatar Jan 29 '21 20:01 oltarasenko

@oltarasenko but that won't change anything for initial requests, would it? is there a way to specify proxy for those as well?

spectator avatar Jan 29 '21 20:01 spectator

@spectator I think it should, as soon as you're defining them as requests, for example as I did here: https://oltarasenko.medium.com/web-scraping-with-elixir-and-crawly-extracting-data-behind-authentication-a52584e9cf13?sk=fa66930ce187204285fb43741a414979 See the part where I am doing the login on start.

oltarasenko avatar Jan 29 '21 20:01 oltarasenko

@oltarasenko appreciate the help! it did work indeed!

spectator avatar Jan 29 '21 21:01 spectator

I tried this but it didn't override the request options set in the config.exs file. Using {:crawly, "~> 0.13.0"}

Crawly.Request.new(
   url,
   [],
   proxy: {"http://myproxy:@proxy.crawlera.com", 8011},
   hackney: [:insecure]
)

But in the request it still has these as the options

options: [
      timeout: 30000,
      recv_timeout: 15000,
      hackney: [basic_auth: {"yaddayadda", ""}],
      follow_redirect: true
    ],

Any help appreciated

bapti avatar Nov 09 '21 16:11 bapti