crawly icon indicating copy to clipboard operation
crawly copied to clipboard

Adding the original request to the parse_item callback

Open nuno84 opened this issue 3 years ago • 3 comments
trafficstars

Based on the discussion I opened: discussion So, I want to crawl, lets say 100 websites and set the item parse elements (floki query elements) set on a webpage for the user to fine-tune. I found this Oleg article usefull: article So my idea is to write a generic HTTP spider and inside it, use that info to parse the data based on the request custom_data. I think this pull is usefull. And I tried to make it backward compatible. So now you can do both:

def parse_item(response), do: ... OR def parse_item(response, request = %Crawly.Request{custom_data: req_data}), do: ...

The apply function is now:

  defp do_parse(nil, spider_name, response, request) do
    if :erlang.function_exported(spider_name, :parse_item, 2) do
      spider_name.parse_item(response, request)
    else
      spider_name.parse_item(response) # This is for backward compatibility
    end
  end

The tests are passing on my computer. Please take a look. I can add some documentation if this idea goes forward. Thank you

nuno84 avatar Sep 14 '22 06:09 nuno84

Super cool to see this, I was just about to go implement this myself after having a use-case for it (crawling image files for which the metadata is located on the previously-crawled page). I will see how well this works for me and take a look at the failing tests too. Unfortunate that development seems to be stalled on this project.

starcraft66 avatar Nov 24 '22 19:11 starcraft66

@nuno84 PR is appreciated. The entire Request gets copied to the Response struct, so adding in the second argument to parse_item callback is unnecessary.

I would suggest using the metadata or meta key on the Request struct instead of custom_data, which is more semantically correct.

Ziinc avatar Nov 26 '22 18:11 Ziinc

The entire Request gets copied to the Response struct

@Ziinc The HTTPoison.Response.t() contains the original HTTPoison.Request.t() struct but I don't think it is helpful in the context of the PR because that HTTPoison.Request.t() will not contain any of the metadata stored in the Crawly.Request.t() wrapping the response.

starcraft66 avatar Feb 02 '23 16:02 starcraft66