crawlee-python icon indicating copy to clipboard operation
crawlee-python copied to clipboard

Reconsider crawler inheritance

Open janbuchar opened this issue 7 months ago • 5 comments

Currently, we have the following inheritance chains:

  • BasicCrawler -> HttpCrawler
  • BasicCrawler -> BeautifulSoupCrawler
  • BasicCrawler -> PlaywrightCrawler
  • BasicCrawler -> ParselCrawler (#348 )

This is an intentional difference from the JS version, where

  • BrowserCrawler is a common ancestor of PlaywrightCrawler and PuppeteerCrawler
    • this is not relevant in Python ecosystem - we won't implement anything similar to Playwright anytime soon
  • CheerioCrawler and JSDomCrawler inherit from HttpCrawler
    • this is the important difference
    • We decided to do this differently to avoid inheritance chains, which make it harder to track down the code that is actually being executed. The cost is a bit of code duplication.
    • In the Python version, we also have the HttpClient abstraction and most of the http-handling logic is contained there

We might want to reconsider this because

  • New HTML parsers are being added as we speak
    • This might make the code duplication too costly to maintain
  • For #249, we would like to have a "parse the current HTML" helper that works with all supported HTML parsers, not just beautifulsoup, for instance

The possible ways out are

  1. Leave it as it is now
  2. Parametrize HttpCrawler with an HTML parser
  • this would make BeautifulSoupCrawler and ParselCrawler very thin - they would just pass the right HttpClient and HtmlParser to HttpCrawler
  • we may want to consider moving the send_request context helper from BasicCrawlingContext to HttpCrawlingContext
  1. Remove HttpCrawler altogether and pull its functionality into BasicCrawler

janbuchar avatar Jul 23 '24 21:07 janbuchar