crawlee-python
crawlee-python copied to clipboard
Reconsider crawler inheritance
Currently, we have the following inheritance chains:
-
BasicCrawler
->HttpCrawler
-
BasicCrawler
->BeautifulSoupCrawler
-
BasicCrawler
->PlaywrightCrawler
-
BasicCrawler
->ParselCrawler
(#348 )
This is an intentional difference from the JS version, where
-
BrowserCrawler
is a common ancestor ofPlaywrightCrawler
andPuppeteerCrawler
- this is not relevant in Python ecosystem - we won't implement anything similar to Playwright anytime soon
-
CheerioCrawler
andJSDomCrawler
inherit fromHttpCrawler
- this is the important difference
- We decided to do this differently to avoid inheritance chains, which make it harder to track down the code that is actually being executed. The cost is a bit of code duplication.
- In the Python version, we also have the HttpClient abstraction and most of the http-handling logic is contained there
We might want to reconsider this because
- New HTML parsers are being added as we speak
- This might make the code duplication too costly to maintain
- For #249, we would like to have a "parse the current HTML" helper that works with all supported HTML parsers, not just beautifulsoup, for instance
The possible ways out are
- Leave it as it is now
- Parametrize
HttpCrawler
with an HTML parser
- this would make
BeautifulSoupCrawler
andParselCrawler
very thin - they would just pass the rightHttpClient
andHtmlParser
toHttpCrawler
- we may want to consider moving the
send_request
context helper fromBasicCrawlingContext
toHttpCrawlingContext
- Remove
HttpCrawler
altogether and pull its functionality intoBasicCrawler