crawler icon indicating copy to clipboard operation
crawler copied to clipboard

Filter by dynamic base URL

Open safwank opened this issue 5 years ago • 3 comments

I'm trying to create a custom UrlFilter that lets through all URLs that start with the base URL passed to Crawler.crawl/2. I know I can use Registry or Agent to track this value, but is there a better way? The url_filter option only accepts a module but not a fun.

safwank avatar Feb 25 '19 04:02 safwank

Heya, if I understand your requirement correctly, you don't need to "track" the URLs. The filter is called on every URL. You pass in a module because the filter needs to conform to the spec/behaviour: https://github.com/fredwu/crawler/blob/f9426764c793a480b5f2cef45261d42ee85fc6a1/lib/crawler/fetcher/url_filter.ex

fredwu avatar Feb 25 '19 08:02 fredwu

I think haven't done a good job explaining the situation.

What I'm trying to do is create a custom filter that limits the URLs to a specific domain, e.g. example.com. I know I can do the following

defmodule CustomFilter do
  @behaviour Crawler.Fetcher.UrlFilter.Spec
  def filter("http://example.com" <> _, _opts), do: {:ok, true}
  def filter(_, _), do: {:ok, false}
end

Crawler.crawl("http://example.com", url_filter: CustomFilter)

But what if I want to crawl a different domain (some_domain) without hardcoding it, e.g.

defmodule CustomFilter do
  @behaviour Crawler.Fetcher.UrlFilter.Spec
  def filter(url, _opts) do
    if String.starts_with?(url, some_domain) do
      {:ok, true}
    else 
      {:ok, false}
    end
  end
end

Crawler.crawl(some_domain, url_filter: CustomFilter)

Also from what I've seen, the opts.url option that gets passed to filter/2 is the same as the first argument (url) passed to the same function. I was expecting the former to match the url passed to Crawler.crawl/1 but I was wrong.

I hope I'm making more sense now :).

safwank avatar Feb 25 '19 22:02 safwank

@safwank i need something similar and found that this works well

        def filter(url, opts) do
		res = cond do
			String.starts_with?(url, ".") -> true
			String.starts_with?(url, "/") -> true
			Map.get(opts, :referrer_url) === nil -> true
			true ->
				URI.parse(opts.referrer_url) |> Map.get(:host) == URI.parse(url) |> Map.get(:host)
		end
		{:ok, res}
	end

In my case, I only filter for urls that share the same domain as the original crawl()

janajri avatar Apr 10 '19 03:04 janajri