crawly General purpose links extractors

One of the problems I am constantly seeing is a need to extract new URLs. And I am looking for a way to simplify it for me and other people as well.

I am thinking of writing a code which will:

take a page body,
extract all links from it
filter these extracted links by a list of patterns provided by us.

For the end-user it would mean. I want my crawler to follow everything which has: "/blog" or "/product" on a given website. So you don't have to write request extractors (which is time-consuming).

Of course, I understand that extracting all links from a page and filtering them, is not ideal from the performance point of view. However, I would still want to have a helper like this.

Problems:

If we want to build it, we need to bring floki back to our project, which I would want to avoid if possible.
it will be hard to switch from Floki to Meeseeks in the case if our Helpers will have Floki based code

Any advises?

Nov 10 '20 12:11 oltarasenko

@Ziinc in general I am quite close to the idea of bringing Floki back here... It can simplify these concerns quite a bit. From one side, I want to be independent, on the other hand, we can write quite a few pre-defined things:

Automatic login form
Auto new links extractions
Maybe auto items extractions, etc

Nov 11 '20 17:11 oltarasenko

We can use the dependency injection pattern to avoid adding a specific html parser as a dep.

On the dev side, we set Floki/meseeks as a dev dependency, and on the user side, the user has to define a module containing callbacks that is is required.

For example, if the user wants to use automatic link extraction using glob patterns , we can construct an xpath based on a given glob pattern on the Crawly side and pass the final xpath to the ParserInterface.list_xpath/1 callback, where the user must set the reference to ParserInterface in the config.

If much work is going to be done in this these magic features, I think defining a protocol like how Plug.Conn does would give tremendous benefits

Nov 12 '20 01:11 Ziinc

Yes, I was thinking about it. It looks like it requires quite a lot of work to have adapters for two parsers we have now (as their APIs are different, e.g. functions names, XPath support, etc). It sounds like a bit of work. And still we need to add one of the backends.

I can play with something like Code.loaded? Floki to either allow using a parser or to raise an exception. However I don't see benefits comparing to just including Floki to the list od deps.

Nov 12 '20 09:11 oltarasenko

The onus for managing the html parsing dep should be on the end user, as managing adaptors for both libraries would be too much work on our side and too restrictive on the user side.

If we go with user defined adaptors, we won't have to manage conditional dep compilation, which seems quite tricky and troublesome when I did a forum search. It also makes these an opt in feature, which for many people they might not even use

Nov 12 '20 09:11 Ziinc

Sorry not quite understood you

Nov 12 '20 10:11 oltarasenko

I see three possible way to implement such helpers:

1. Through a user-defined parsing interface that implements required parsing callbacks

 # User's config

     config :crawly,
         parser: MyHtmlParser
         # ...

 # Crawly source code
 defmodule Crawly.Extractors do
     def extract_urls(body, glob) do
        # Crawly has no html parser dependency, obtain user definedmodule and try to extract urls
        # obtain the parser module, defined in config with `parser: MyHtmlParser`
        parser = Crawly.Utils.get_settings(:parser)
        xpath = parse_glob_pattern(glob)
        # call function based on mfa at runtime, return requests
        apply(parser, :list_xpath,[xpath])
        |> build_requests()
     end
 end

 # User's source code
 # User's app depends on Floki/Meeseeks or Jason/Poison
 defmodule MyHtmlParser do
     @behaviour Crawly.Parser
     import Floki
     import Jason
     @impl true
     def list_xpath(bla), do: Floki.parse(bla)
     @impl true
     def find_json(bla), do: Jason.parse(bla)
 end

 # User's spider
 defmodule MySpider do
     def parse_item(response) do
     requests = Crawly.Extractors.extract_urls(response.body, "/products/*")
     [requests: requests]
     end
 end

Pros:

most freedom
less maintenance

Cons:

user has to implement the callbacks that they want (but is it really a drawback though? more control over parsing process)

2. Through Crawly-defined parsing interface that uses a Crawly-decided html parser

 # Crawly source code
 defmodule Crawly.Extractors do
     def extract_urls(body, glob) do
        # Crawly has dependency on Floki, use directly, do floki related stuff. Simplifying with example functions
        glob
        |> parse_glob_pattern()
        |> Floki.find_urls_from_given_pattern()
        |> build_requests
     end
 end

 # User's spider
 defmodule MySpider do
     def parse_item(response) do
     requests = Crawly.Extractors.extract_urls(response.body, "/products/*")
     [requests: requests]
     end
 end

Cons:

stuck with 1 html parser, user has no option to override with different html parser (e.g. meeseeks)

3. Through Crawly-defined parsing interface that uses a user-decided html parser

    config :crawly,
        html_parser: Meeseeks,
        json_parser: Jason
        # ...

 # Crawly source code
 defmodule Crawly.Extractors do
     def extract_urls(body, glob) do
        # Crawly has no Meeseeks/Floki, need to check which html parser is given
        xpath = parse_glob_pattern(glob)

        # this may not actually compile on the user side, since they might not have Floki/Meeseeks
        # will need to do some conditional compilation magic to ensure code can compile
        cond do
            Code.loaded?(Floki) ->
                # floki specific code
                Floki.find_urls_from_given_pattern()
            Code.loaded?(Meeseeks) ->
                # meeseeks specific code
                Meeseeks.find_urls_from_given_pattern()
            true ->
                Logger.error("No supported html parser is provided and compiled.")
                []
        end
        |> build_requests
     end
 end

 # User's spider
 defmodule MySpider do
     def parse_item(response) do
     requests = Crawly.Extractors.extract_urls(response.body, "/products/*")
     [requests: requests]
     end
 end

Cons:

need to maintain library-specific code = more work for us
conditional compilation required

Nov 12 '20 14:11 Ziinc

In my replies, i was talking about why option 1 is preferable as compared to 2 and 3.

Nov 12 '20 14:11 Ziinc

Heh :(.

Actually I don't want to force people writing any extra code. E.g. adapters or anything like that. In any case the conversation was quite useful, as I think I will follow the hybrid idea.

So I see it done like this:

if Code.loaded?(Floki) do
   do_extract_urls(page)
else
  Logger.error("General purpose extractor relays on Floki")
end

I think it will be quite simple to start with. Then we can play a bit more with the idea of having builtin parsers.

Nov 12 '20 19:11 oltarasenko

No issues with the hybrid approach, it is what quite a few frameworks use for handling json parsing (phoenix for example, off the top of my head).

I only worry about maintenance, like the what-if scenario where there are breaking api changes in a library between versions. Then we'd have to maintain two different pieces of code for one library plus check for the api version to know which piece of code to use

Nov 13 '20 01:11 Ziinc

I will close this one, as it's been open for years, and no one has had time or need to lead the work.

Apr 09 '24 10:04 oltarasenko

crawly crawly copied to clipboard

General purpose links extractors

1. Through a user-defined parsing interface that implements required parsing callbacks

2. Through Crawly-defined parsing interface that uses a Crawly-decided html parser

3. Through Crawly-defined parsing interface that uses a user-decided html parser

crawly
crawly copied to clipboard