Reconsider crawler inheritance
Currently, we have the following inheritance chains:
BasicCrawler->HttpCrawlerBasicCrawler->BeautifulSoupCrawlerBasicCrawler->PlaywrightCrawlerBasicCrawler->ParselCrawler(#348 )
This is an intentional difference from the JS version, where
BrowserCrawleris a common ancestor ofPlaywrightCrawlerandPuppeteerCrawler- this is not relevant in Python ecosystem - we won't implement anything similar to Playwright anytime soon
CheerioCrawlerandJSDomCrawlerinherit fromHttpCrawler- this is the important difference
- We decided to do this differently to avoid inheritance chains, which make it harder to track down the code that is actually being executed. The cost is a bit of code duplication.
- In the Python version, we also have the HttpClient abstraction and most of the http-handling logic is contained there
We might want to reconsider this because
- New HTML parsers are being added as we speak
- This might make the code duplication too costly to maintain
- For #249, we would like to have a "parse the current HTML" helper that works with all supported HTML parsers, not just beautifulsoup, for instance
The possible ways out are
- Leave it as it is now
- Parametrize
HttpCrawlerwith an HTML parser
- this would make
BeautifulSoupCrawlerandParselCrawlervery thin - they would just pass the rightHttpClientandHtmlParsertoHttpCrawler - we may want to consider moving the
send_requestcontext helper fromBasicCrawlingContexttoHttpCrawlingContext
- Remove
HttpCrawleraltogether and pull its functionality intoBasicCrawler
For https://github.com/apify/crawlee-python/issues/249, we would like to have a "parse the current HTML" helper that works with all supported HTML parsers, not just beautifulsoup, for instance
I wanted to have something like parseWIthCheerio here too, not just for adaptive crawler. But if we allow different parses in it, I am not sure how portable the code will be, since the parser dictates the return type of such helper, right?
Parametrize HttpCrawler with an HTML parser
I like this one. But I would say we want to have some sane default, let it be beautifulsoup or anything else. I guess it depends on how fat that dependency is, since this default should be always installed, that's how we do it in JS version, cheerio is not an optional dependency.
we may want to consider moving the send_request context helper from BasicCrawlingContext to HttpCrawlingContext
Similarly to the default parser, we should have default request client, and if we have one, it feels ok to have that helper in basic crawler.
For #249, we would like to have a "parse the current HTML" helper that works with all supported HTML parsers, not just beautifulsoup, for instance
I wanted to have something like
parseWIthCheeriohere too, not just for adaptive crawler. But if we allow different parses in it, I am not sure how portable the code will be, since the parser dictates the return type of such helper, right?
Yeah, the crawler class would have to be generic over the return type of the HTML parser, if that makes any sense. And it wouldn't provide the same "pluggability" that we intend to have for HTTP clients... which is probably fine.
Parametrize HttpCrawler with an HTML parser
I like this one. But I would say we want to have some sane default, let it be beautifulsoup or anything else. I guess it depends on how fat that dependency is, since this default should be always installed, that's how we do it in JS version, cheerio is not an optional dependency.
I imagine we could either keep BeautifulSoupCrawler (but it would not contain much logic), and have HttpCrawler have no HTML parser by default (it could throw an error when somebody attempts to parse HTML). Or we could have any parser as the default and check for the dependencies on instantiation (we already do a similar thing when importing BeautifulSoupCrawler and PlaywrightCrawler)
we may want to consider moving the send_request context helper from BasicCrawlingContext to HttpCrawlingContext
Similarly to the default parser, we should have default request client, and if we have one, it feels ok to have that helper in basic crawler.
Yeah, currently HttpxClient is the default.
I made a gist to illustrate a possible new inheritance hierarchy, feel free to comment. https://gist.github.com/janbuchar/0412e1b4224065e40e937e91d474f145
I made a gist to illustrate a possible new inheritance hierarchy, feel free to comment. https://gist.github.com/janbuchar/0412e1b4224065e40e937e91d474f145
It looks great!
I'm even thinking about whether specific subclasses like BeautifulSoupCrawler / ParselCrawler might be unnecessary when the HttpCrawler class itself can serve the purpose with the proper configuration of parsers and HTTP clients (with some abstractions, as you suggested).
class BeautifulSoupCrawler(
HttpCrawler[BeautifulSoupCrawlingContext, BeautifulSoupResult]
):
pass
It would be a big breaking change of course...
Also, It seems in your PoC I can do the following:
parsel_crawler = ParselCrawler(
http_client=httpx_client,
parser=BeautifulSoupStaticContentParser()
)
(Having an instance of ParselCrawler with the BeautifulSoup parser.)
We probably wouldn't want to expose the parser on the BS/Parsel level.
I'm even thinking about whether specific subclasses like
BeautifulSoupCrawler/ParselCrawlermight be unnecessary when theHttpCrawlerclass itself can serve the purpose with the proper configuration of parsers and HTTP clients (with some abstractions, as you suggested).class BeautifulSoupCrawler( HttpCrawler[BeautifulSoupCrawlingContext, BeautifulSoupResult] ): pass
Well, you could just use HttpCrawler, true, but it wouldn't be as user-friendly - manually writing out type parameters gets tedious fast. I'd probably keep those classes just for convenience.
Also, It seems in your PoC I can do the following:
parsel_crawler = ParselCrawler( http_client=httpx_client, parser=BeautifulSoupStaticContentParser() )(Having an instance of
ParselCrawlerwith theBeautifulSoupparser.)We probably wouldn't want to expose the
parseron the BS/Parsel level.
True.