crawly
crawly copied to clipboard
[Help] Middlewares not being overridden on a Spider level
Hey 👋 ,
I'm having an issue while overriding settings on a spider level. According to the docs doing something like:
@impl Crawly.Spider
def override_settings() do
[
middlewares: [CrawlyOverrides.Headers]
]
end
would be enough, but after debugging the request in parse_item the settings are not being overridden.
Am I missing something here?
Here's my mix.exs:
defmodule CrawlyOverrides.MixProject do
use Mix.Project
def project do
[
app: :crawly_overrides,
version: "0.1.0",
elixir: "~> 1.11",
start_permanent: Mix.env() == :prod,
deps: deps()
]
end
# Run "mix help compile.app" to learn about applications.
def application do
[
extra_applications: [:logger],
mod: {CrawlyOverrides.Application, []}
]
end
# Run "mix help deps" to learn about dependencies.
defp deps do
[
{:crawly, "~> 0.11.0"},
{:floki, "~> 0.26.0"},
]
end
end
The Config:
import Config
config :crawly,
item: [
:id,
],
middlewares: [
Crawly.Middlewares.AutoCookiesManager,
{Crawly.Middlewares.UserAgent, user_agents: ["My Custom Bot"] }
],
pipelines: [
]
And Middleware:
defmodule CrawlyOverrides.Headers do
@behaviour Crawly.Pipeline
@impl Crawly.Pipeline
def run(request, state) do
cookie =
"new_cookie;"
headers = request.headers ++ [cookie: cookie]
new_request =
Map.put(request, :headers, headers)
{new_request, state}
end
end
This is what I get with the middleware on a global level:
%HTTPoison.Request{
body: "",
headers: [{"User-Agent", "My Custom Bot"}, {:cookie, "new_cookie;"}],
method: :get,
options: [],
params: %{},
url: "https://www.homebase.co.uk/our-range/lighting-and-electrical/torches-and-nightlights/worklights"
}
And by moving it from config.exs to the Spider:
%HTTPoison.Request{
body: "",
headers: [{"User-Agent", "My Custom Bot"}],
method: :get,
options: [],
params: %{},
url: "https://www.homebase.co.uk/our-range/tools"
}
Any help would be appreciated here, and awesome job on Crawly!
@jfmlima hi this is a bug, thanks for the report. It is caused by this line here: https://github.com/oltarasenko/crawly/blob/8c8b3651559529bcb81ec1477ade18386f794f14/lib/crawly/request.ex#L57
@oltarasenko the requests creation might need some refactoring. As there is no spider reference being passed, we can't use Crawly.Utils.get_settings(:middlewares, MySpider, default_middlewares). I think adjusting the function args to accept url + spider name + opts would be a good.
Oh, indeed. So basically it not possible to override middleware because of it :(
I need to think more about the issue. From one side, for now, you should just use https://github.com/oltarasenko/crawly/blob/8c8b3651559529bcb81ec1477ade18386f794f14/lib/crawly/request.ex#L72 to create new requests.
I am not quite sure how to address it. It's possible to override middlewares on the request storage side: https://github.com/oltarasenko/crawly/blob/8c8b3651559529bcb81ec1477ade18386f794f14/lib/crawly/requests_storage/requests_storage_worker.ex#L66 (which also knows the spider name, so we can use the function referred by @Ziinc). However it means that we're overriding middlewares manually added by user to a request, which limits possibilities here. From one side maybe it's fine, if override settings is defined, they are overriden.
Hey @Ziinc & @oltarasenko, thanks for the quick feedback!
@oltarasenko can you provide some examples on how to manually create new requests please? Or was that suggestion not meant for me?
Sorry, @jfmlima, I've managed to catch flu.
Regarding manual creation of requests: Spider's parse_item/1 function is supposed to return new items and new requests, for example this process is shown in the tutorial here: https://hexdocs.pm/crawly/0.11.0/tutorial.html#extracting-data-in-our-spider
So in this example case:
# Convert URLs into Requests
requests =
urls
|> Enum.uniq()
|> Enum.map(&build_absolute_url/1)
|> Enum.map(&Crawly.Utils.request_from_url/1)```
You would need to replace this last `&Crawly.Utils.request_from_url/1` and to create requests manually with: `fn url -> Crawly.Request.new(url, [], [], middlewares()) end`
Hey @oltarasenko, sorry to hear that, hope you're feeling better!
I've eventually figured it out and created my own utils with exactly what you suggested, it doesn't work however with the first request which is the most important one to continue scraping.
I've seen that it will be allowed to pass requests as well in the start_requests here https://github.com/oltarasenko/crawly/commit/8c8b3651559529bcb81ec1477ade18386f794f14, so all good. I'll keep it global for now, and adjust it for the new release!
Thanks for all your efforts here!
@jfmlima I am preparing the 0.12.0 rollout just now.
Terrific @oltarasenko, let me know if I can be of any help
@jfmlima Just done. I have tested it a bit, so it should work fine. Please let me know if something goes wrong, so I will prepare a bugfix.
Thanks buddy, will do!
@oltarasenko There is probably another issue with overriding middleware options, I have to use proxy for one of the spiders, when I add Crawly.Middlewares.RequestOptions with proxy settings to main config it works fine, but overriding on spider lever doesn't, even though in iex calling Crawly.Utils.get_settings(:middlewares, FooSpider) shows settings with overrides but then spider started with Crawly.Engine.start_spider(FooSpider) seem to completely ignore proxy settings.
@spectator Ok, I see. Yes as it was mentioned before we do not override middleware at all on the spider level, that's why your per-spider config is not taken into account here :(.
Could you just add a proxy to your requests when requests are created by the spider (in the parse_item function)?
@oltarasenko but that won't change anything for initial requests, would it? is there a way to specify proxy for those as well?
@spectator I think it should, as soon as you're defining them as requests, for example as I did here: https://oltarasenko.medium.com/web-scraping-with-elixir-and-crawly-extracting-data-behind-authentication-a52584e9cf13?sk=fa66930ce187204285fb43741a414979 See the part where I am doing the login on start.
@oltarasenko appreciate the help! it did work indeed!
I tried this but it didn't override the request options set in the config.exs file. Using {:crawly, "~> 0.13.0"}
Crawly.Request.new(
url,
[],
proxy: {"http://myproxy:@proxy.crawlera.com", 8011},
hackney: [:insecure]
)
But in the request it still has these as the options
options: [
timeout: 30000,
recv_timeout: 15000,
hackney: [basic_auth: {"yaddayadda", ""}],
follow_redirect: true
],
Any help appreciated