Add scrapling as a parser
Hi, I like your project, and I see you are using BeautifulSoup and Parsel as parsers. Well, actually, there's a better and faster new parser, which is Scrapling.
Scrapling has several new features, including the use of a custom version of Camoufox, which is more stable than Camoufox's Python interface. Additionally, it features its own parser, built on lxml, similar to parsel. However, unlike Parsel, it doesn't only provide ways to select elements by CSS/XPATH selectors. Still, it also offers new options, such as selecting elements by their text content using lateral search or regular expressions, and a find function similar to the one BS has, but more powerful and multiple times faster.
It also provides a way to make self-healing spiders that adapt to website design changes without AI, and it also provides a method to find elements that are similar to found elements like AutoScraper, but it's way faster and better in this.
If all these features are not sufficient, then here are two benchmarks from the documentation that compare it to other libraries in the market:
Text Extraction Speed Test (5000 nested elements)
| # | Library | Time (ms) | vs Scrapling |
|---|---|---|---|
| 1 | Scrapling | 1.92 | 1.0x |
| 2 | Parsel/Scrapy | 1.99 | 1.036x |
| 3 | Raw Lxml | 2.33 | 1.214x |
| 4 | PyQuery | 20.61 | ~11x |
| 5 | Selectolax | 80.65 | ~42x |
| 6 | BS4 with Lxml | 1283.21 | ~698x |
| 7 | MechanicalSoup | 1304.57 | ~679x |
| 8 | BS4 with html5lib | 3331.96 | ~1735x |
Element Similarity & Text Search Performance
| Library | Time (ms) | vs Scrapling |
|---|---|---|
| Scrapling | 1.87 | 1.0x |
| AutoScraper | 10.24 | 5.476x |
It would be a solid addition to Crawlee to have it as an extra. What do you think? I'm the author of Scrapling, so if there are any modifications needed to make this happen, tell me.
Hi @D4Vinci, thanks for sharing it! We'll explore the tool and consider its integration with Crawlee.
Hi @vdusek
I have released a new version that added many new features, but most importantly, it separated the fetchers' dependencies into an extra. So now the core package is only the parser, which means the core dependencies you have to install to use the parser are less than before by half.
This should make it very much easier to integrate here. What do you think?
Hey @vdusek Do you have any updates about this? I have released another two versions since my last comment 😄
I have also updated the benchmarks above to the latest versions, which implies Scrapling became even faster since I opened this issue
Hi @D4Vinci @vdusek , is there any update about it? It would be great to know if it’s planned or in progress. Thanks!
Hi @D4Vinci @vdusek , is there any update about it? It would be great to know if it’s planned or in progress. Thanks!
Thanks @AbdullahY36 I'm still waiting for @vdusek reply
Hi @AbdullahY36, we are planning to look into this in the near future, but we are focusing on polishing Crawlee version 1.0 at the moment.
Hi @AbdullahY36, we are planning to look into this in the near future, but we are focusing on polishing Crawlee version 1.0 at the moment.
Thanks @janbuchar I'm looking forward to it!
Hi, we've decided not to include Scrapling directly in the Crawlee codebase as an alternative HTTP-based crawler to BeautifulSoup and Parsel. Instead, we'll continue to support only Parsel - as the performant parser that supports both CSS and XPath selectors - and BeautifulSoup, which remains one of the most popular parsers (historically at least) and is well-known for its ability to handle invalid/incomplete HTMLs.
However, integrating Crawlee with Scrapling should be quite straightforward. As I understand it, the Scrapling library includes, among others, an HTML parser and an HTTP client. The parser can be used with HttpCrawler as a custom parser, and the HTTP client can be integrated by implementing the BaseHttpClient interface.
Our goal is to make Crawlee as pluggable as possible. For example, we already have a guide on integrating Stagehand for AI-based selectors: Using Stagehand with PlaywrightCrawler. We could mention Scrapling integration in the documentation as well - either through a dedicated guide, or by extending the existing HTTP Crawlers and HTTP Clients guides.
Thanks @vdusek, for the reply. I would love to have Scrapling featured in your documentation. Please feel free to let me know if I can help you with anything related to that.
As a clarification, Scrapling is on par with parsel in terms of speed, as both utilize lxml. However, Scrapling is using the same CSS/XPATH translator as parsel (it's copied from it as mentioned in the README) to make it easy for new users coming from Scrapy/Parsel.
Also, Scrapling automatically handles invalid/incomplete HTML through lxml, which is the same method parsel does.
So adding Scrapling would keep the CSS/XPath selection logic the same, write less code than Parsel, and automatically handle invalid/incomplete HTML.
In addition to that, you get a slight speed increase, the ability to make crawlers adapt to website design changes, and new selection methods besides CSS/Xpath to make it easier to handle annoying websites (like find_similar, which is better, uses less memory, and 5 times faster than autoscraper which relies on BeautifulSoup as well).
Also, this is an article that shows every function in BeautifulSoup with its equivalent in Scrapling: https://scrapling.readthedocs.io/en/latest/tutorials/migrating_from_beautifulsoup/
Hello, any updates?
Hey @D4Vinci
@vdusek asked me to take care of this task.
I delved a little into the code of your framework. And I must say that it's great work.
However, I agree with @vdusek that we should not add it as a core extension of Crawlee. There are several reasons for this. The main one is that, despite the fact that you have separated the fetchers from the core scrapling code. All your documentation focuses on the fact that it is a self-sufficient framework. For someone reading your documentation, it is not a parser, it is a framework for crawling. Another reason is that when integrated into Crawlee, some of your scraper's capabilities would be limited. This would make it difficult to use it to its full potential.
However, we decided to expand our documentation to include how Crawlee can be used with third-party parsers not included in the project code. This section of the documentation will also mention scrapling.