crawlee-python icon indicating copy to clipboard operation
crawlee-python copied to clipboard

Add scrapling as a parser

Open D4Vinci opened this issue 3 months ago • 11 comments

Hi, I like your project, and I see you are using BeautifulSoup and Parsel as parsers. Well, actually, there's a better and faster new parser, which is Scrapling.

Scrapling has several new features, including the use of a custom version of Camoufox, which is more stable than Camoufox's Python interface. Additionally, it features its own parser, built on lxml, similar to parsel. However, unlike Parsel, it doesn't only provide ways to select elements by CSS/XPATH selectors. Still, it also offers new options, such as selecting elements by their text content using lateral search or regular expressions, and a find function similar to the one BS has, but more powerful and multiple times faster.

It also provides a way to make self-healing spiders that adapt to website design changes without AI, and it also provides a method to find elements that are similar to found elements like AutoScraper, but it's way faster and better in this.

If all these features are not sufficient, then here are two benchmarks from the documentation that compare it to other libraries in the market:

Text Extraction Speed Test (5000 nested elements)

# Library Time (ms) vs Scrapling
1 Scrapling 1.92 1.0x
2 Parsel/Scrapy 1.99 1.036x
3 Raw Lxml 2.33 1.214x
4 PyQuery 20.61 ~11x
5 Selectolax 80.65 ~42x
6 BS4 with Lxml 1283.21 ~698x
7 MechanicalSoup 1304.57 ~679x
8 BS4 with html5lib 3331.96 ~1735x

Element Similarity & Text Search Performance

Library Time (ms) vs Scrapling
Scrapling 1.87 1.0x
AutoScraper 10.24 5.476x

It would be a solid addition to Crawlee to have it as an extra. What do you think? I'm the author of Scrapling, so if there are any modifications needed to make this happen, tell me.

D4Vinci avatar Sep 03 '25 16:09 D4Vinci

Hi @D4Vinci, thanks for sharing it! We'll explore the tool and consider its integration with Crawlee.

vdusek avatar Sep 04 '25 13:09 vdusek

Hi @vdusek

I have released a new version that added many new features, but most importantly, it separated the fetchers' dependencies into an extra. So now the core package is only the parser, which means the core dependencies you have to install to use the parser are less than before by half.

This should make it very much easier to integrate here. What do you think?

D4Vinci avatar Sep 15 '25 01:09 D4Vinci

Hey @vdusek Do you have any updates about this? I have released another two versions since my last comment 😄

I have also updated the benchmarks above to the latest versions, which implies Scrapling became even faster since I opened this issue

D4Vinci avatar Sep 16 '25 15:09 D4Vinci

Hi @D4Vinci @vdusek , is there any update about it? It would be great to know if it’s planned or in progress. Thanks!

AbdullahY36 avatar Sep 25 '25 15:09 AbdullahY36

Hi @D4Vinci @vdusek , is there any update about it? It would be great to know if it’s planned or in progress. Thanks!

Thanks @AbdullahY36 I'm still waiting for @vdusek reply

D4Vinci avatar Sep 25 '25 16:09 D4Vinci

Hi @AbdullahY36, we are planning to look into this in the near future, but we are focusing on polishing Crawlee version 1.0 at the moment.

janbuchar avatar Sep 26 '25 11:09 janbuchar

Hi @AbdullahY36, we are planning to look into this in the near future, but we are focusing on polishing Crawlee version 1.0 at the moment.

Thanks @janbuchar I'm looking forward to it!

D4Vinci avatar Sep 26 '25 12:09 D4Vinci

Hi, we've decided not to include Scrapling directly in the Crawlee codebase as an alternative HTTP-based crawler to BeautifulSoup and Parsel. Instead, we'll continue to support only Parsel - as the performant parser that supports both CSS and XPath selectors - and BeautifulSoup, which remains one of the most popular parsers (historically at least) and is well-known for its ability to handle invalid/incomplete HTMLs.

However, integrating Crawlee with Scrapling should be quite straightforward. As I understand it, the Scrapling library includes, among others, an HTML parser and an HTTP client. The parser can be used with HttpCrawler as a custom parser, and the HTTP client can be integrated by implementing the BaseHttpClient interface.

Our goal is to make Crawlee as pluggable as possible. For example, we already have a guide on integrating Stagehand for AI-based selectors: Using Stagehand with PlaywrightCrawler. We could mention Scrapling integration in the documentation as well - either through a dedicated guide, or by extending the existing HTTP Crawlers and HTTP Clients guides.

vdusek avatar Oct 06 '25 08:10 vdusek

Thanks @vdusek, for the reply. I would love to have Scrapling featured in your documentation. Please feel free to let me know if I can help you with anything related to that.

As a clarification, Scrapling is on par with parsel in terms of speed, as both utilize lxml. However, Scrapling is using the same CSS/XPATH translator as parsel (it's copied from it as mentioned in the README) to make it easy for new users coming from Scrapy/Parsel. Also, Scrapling automatically handles invalid/incomplete HTML through lxml, which is the same method parsel does.

So adding Scrapling would keep the CSS/XPath selection logic the same, write less code than Parsel, and automatically handle invalid/incomplete HTML. In addition to that, you get a slight speed increase, the ability to make crawlers adapt to website design changes, and new selection methods besides CSS/Xpath to make it easier to handle annoying websites (like find_similar, which is better, uses less memory, and 5 times faster than autoscraper which relies on BeautifulSoup as well).

Also, this is an article that shows every function in BeautifulSoup with its equivalent in Scrapling: https://scrapling.readthedocs.io/en/latest/tutorials/migrating_from_beautifulsoup/

D4Vinci avatar Oct 06 '25 14:10 D4Vinci

Hello, any updates?

D4Vinci avatar Nov 01 '25 20:11 D4Vinci

Hey @D4Vinci

@vdusek asked me to take care of this task.

I delved a little into the code of your framework. And I must say that it's great work.

However, I agree with @vdusek that we should not add it as a core extension of Crawlee. There are several reasons for this. The main one is that, despite the fact that you have separated the fetchers from the core scrapling code. All your documentation focuses on the fact that it is a self-sufficient framework. For someone reading your documentation, it is not a parser, it is a framework for crawling. Another reason is that when integrated into Crawlee, some of your scraper's capabilities would be limited. This would make it difficult to use it to its full potential.

However, we decided to expand our documentation to include how Crawlee can be used with third-party parsers not included in the project code. This section of the documentation will also mention scrapling.

Mantisus avatar Nov 01 '25 21:11 Mantisus