crawlee-python icon indicating copy to clipboard operation
crawlee-python copied to clipboard

Reconsider crawler inheritance

Open janbuchar opened this issue 1 year ago • 5 comments

Currently, we have the following inheritance chains:

  • BasicCrawler -> HttpCrawler
  • BasicCrawler -> BeautifulSoupCrawler
  • BasicCrawler -> PlaywrightCrawler
  • BasicCrawler -> ParselCrawler (#348 )

This is an intentional difference from the JS version, where

  • BrowserCrawler is a common ancestor of PlaywrightCrawler and PuppeteerCrawler
    • this is not relevant in Python ecosystem - we won't implement anything similar to Playwright anytime soon
  • CheerioCrawler and JSDomCrawler inherit from HttpCrawler
    • this is the important difference
    • We decided to do this differently to avoid inheritance chains, which make it harder to track down the code that is actually being executed. The cost is a bit of code duplication.
    • In the Python version, we also have the HttpClient abstraction and most of the http-handling logic is contained there

We might want to reconsider this because

  • New HTML parsers are being added as we speak
    • This might make the code duplication too costly to maintain
  • For #249, we would like to have a "parse the current HTML" helper that works with all supported HTML parsers, not just beautifulsoup, for instance

The possible ways out are

  1. Leave it as it is now
  2. Parametrize HttpCrawler with an HTML parser
  • this would make BeautifulSoupCrawler and ParselCrawler very thin - they would just pass the right HttpClient and HtmlParser to HttpCrawler
  • we may want to consider moving the send_request context helper from BasicCrawlingContext to HttpCrawlingContext
  1. Remove HttpCrawler altogether and pull its functionality into BasicCrawler

janbuchar avatar Jul 23 '24 21:07 janbuchar

For https://github.com/apify/crawlee-python/issues/249, we would like to have a "parse the current HTML" helper that works with all supported HTML parsers, not just beautifulsoup, for instance

I wanted to have something like parseWIthCheerio here too, not just for adaptive crawler. But if we allow different parses in it, I am not sure how portable the code will be, since the parser dictates the return type of such helper, right?

Parametrize HttpCrawler with an HTML parser

I like this one. But I would say we want to have some sane default, let it be beautifulsoup or anything else. I guess it depends on how fat that dependency is, since this default should be always installed, that's how we do it in JS version, cheerio is not an optional dependency.

we may want to consider moving the send_request context helper from BasicCrawlingContext to HttpCrawlingContext

Similarly to the default parser, we should have default request client, and if we have one, it feels ok to have that helper in basic crawler.

B4nan avatar Jul 24 '24 13:07 B4nan

For #249, we would like to have a "parse the current HTML" helper that works with all supported HTML parsers, not just beautifulsoup, for instance

I wanted to have something like parseWIthCheerio here too, not just for adaptive crawler. But if we allow different parses in it, I am not sure how portable the code will be, since the parser dictates the return type of such helper, right?

Yeah, the crawler class would have to be generic over the return type of the HTML parser, if that makes any sense. And it wouldn't provide the same "pluggability" that we intend to have for HTTP clients... which is probably fine.

Parametrize HttpCrawler with an HTML parser

I like this one. But I would say we want to have some sane default, let it be beautifulsoup or anything else. I guess it depends on how fat that dependency is, since this default should be always installed, that's how we do it in JS version, cheerio is not an optional dependency.

I imagine we could either keep BeautifulSoupCrawler (but it would not contain much logic), and have HttpCrawler have no HTML parser by default (it could throw an error when somebody attempts to parse HTML). Or we could have any parser as the default and check for the dependencies on instantiation (we already do a similar thing when importing BeautifulSoupCrawler and PlaywrightCrawler)

we may want to consider moving the send_request context helper from BasicCrawlingContext to HttpCrawlingContext

Similarly to the default parser, we should have default request client, and if we have one, it feels ok to have that helper in basic crawler.

Yeah, currently HttpxClient is the default.

janbuchar avatar Jul 24 '24 14:07 janbuchar

I made a gist to illustrate a possible new inheritance hierarchy, feel free to comment. https://gist.github.com/janbuchar/0412e1b4224065e40e937e91d474f145

janbuchar avatar Sep 02 '24 14:09 janbuchar

I made a gist to illustrate a possible new inheritance hierarchy, feel free to comment. https://gist.github.com/janbuchar/0412e1b4224065e40e937e91d474f145

It looks great!

I'm even thinking about whether specific subclasses like BeautifulSoupCrawler / ParselCrawler might be unnecessary when the HttpCrawler class itself can serve the purpose with the proper configuration of parsers and HTTP clients (with some abstractions, as you suggested).

class BeautifulSoupCrawler(
    HttpCrawler[BeautifulSoupCrawlingContext, BeautifulSoupResult]
):
    pass

It would be a big breaking change of course...

Also, It seems in your PoC I can do the following:

parsel_crawler = ParselCrawler(
    http_client=httpx_client, 
    parser=BeautifulSoupStaticContentParser()
)

(Having an instance of ParselCrawler with the BeautifulSoup parser.)

We probably wouldn't want to expose the parser on the BS/Parsel level.

vdusek avatar Sep 03 '24 07:09 vdusek

I'm even thinking about whether specific subclasses like BeautifulSoupCrawler / ParselCrawler might be unnecessary when the HttpCrawler class itself can serve the purpose with the proper configuration of parsers and HTTP clients (with some abstractions, as you suggested).

class BeautifulSoupCrawler(
    HttpCrawler[BeautifulSoupCrawlingContext, BeautifulSoupResult]
):
    pass

Well, you could just use HttpCrawler, true, but it wouldn't be as user-friendly - manually writing out type parameters gets tedious fast. I'd probably keep those classes just for convenience.

Also, It seems in your PoC I can do the following:

parsel_crawler = ParselCrawler(
    http_client=httpx_client, 
    parser=BeautifulSoupStaticContentParser()
)

(Having an instance of ParselCrawler with the BeautifulSoup parser.)

We probably wouldn't want to expose the parser on the BS/Parsel level.

True.

janbuchar avatar Sep 03 '24 08:09 janbuchar