Add support for SQLite storage client
- Implement
SqliteStorageClientas an alternative toMemoryStorageClient. - Some reasoning can be found in https://github.com/apify/crawlee-python/issues/860
- It should be implemented via an ORM, to support multiple dialects out of the box, probably SQLAlchemy but do proper investigation before.
There are clear advantages to not using an ORM (Object-Relational Mapper):
- Increased Performance: An ORM adds an extra layer of abstraction, which can slow down operations compared to writing raw SQL directly.
- Reduced Maintenance Overhead: ORMs evolve, requiring updating your code to stay compatible. In contrast, SQL adheres to the ANSI standard and changes much less frequently.
- Better Quality: All software has bugs, including ORMs. By not using an ORM, you reduce the overall surface area for defects in your system.
- Lower Memory Consumption: ORMs consume additional RAM, which can be problematic in resource-constrained environments.
- Improved Tooling: SQL databases are supported by a wide range of mature design, analysis, and inspection tools.
- Better Scaling: On large datasets, the complex queries generated by an ORM can become a performance bottleneck, whereas optimized SQL can scale more effectively.
- Simpler Dependency Management: Reducing third-party libraries minimizes conflicts and maintenance issues.
- Broader Python Version Compatibility: Many ORMs lag in supporting all currently maintained Python versions.
- Reduced Learning Curve for Developers: SQL is a fundamental skill for most developers, so avoiding an ORM eliminates the need for additional learning.
There are also advantages to providing just one solution (i.e., don't have both a SqliteStorageClient and MemoryStorageClient):
- Saves you labour: Without multiple configurations or abstraction layers, there's less code to write, maintain, and document.
- No Learning Curve for Users: Your users don’t need to choose between options or configure them—they can get started right away.
While ORMs are great for getting a project off to a quick start, their long-term drawbacks often outweigh this one advantage. For small, short-term projects, they suffice. However, avoiding ORMs may be the better path for production-grade software where performance, maintainability, and scalability matter.
SQLite and the libraries that access it are standard parts of the Python ecosystem, providing a powerful, lightweight, and efficient solution for database management.
Hi! I'm looking to contribute and thought this would be an interesting issue to work on.
To get some clarity, are we looking to use an ORM to create a place to "save" the scraped data to, replacing the storage output directory containing the scraped contents? Or, is it meant to act like RAM, allowing scraped info to be "loaded into memory" and further processed/acted on down the line?
I've looked through this issue and the related one linked, as well as took a loot at MemoryStorageClient and StorageClient classes on Crawlee.dev.
Hi @JoeRamaci , it's awesome that you want to contribute! However, the storage system is currently going through a major refactor, so it's not the best time to start working on this right now.
So I suggest you either pick some other issue or wait for #1107 to get merged
Ah, I hear you @janbuchar. However, I'm gonna level with you- I'm in a high level Computer Science course, and part of our assignment is to solve issues with a repo on GitHub. Even if my changes don't end up being accepted, I'd still need to try to "solve" this issue in some form- if you get a chance, some clarity on how Crawlee's storage/memory works would still be super helpful!
I totally get you! Are there any requirements on the issue that you need to solve? Perhaps we could help you find something else to work on so that your code can actually be useful 🙂
It wasn't so much requirements we were given, as it was just going out and picking some that sounded solvable. My group I'm working with has picked out a 8 issues plus this one and committed to fixing 5 (no take backsies sadly). Here are the links to said issues: https://github.com/apify/crawlee-python/issues/861
https://github.com/apify/crawlee-python/issues/404
https://github.com/apify/crawlee-python/issues/188
https://github.com/apify/crawlee-python/issues/47
https://github.com/apify/crawlee-python/issues/838
https://github.com/apify/crawlee-python/issues/1131
https://github.com/apify/crawlee-python/issues/769
https://github.com/apify/crawlee-python/issues/1055
Hey @JoeRamaci, if the selection is final, I wish you good luck 🙂
For this one, all you need to do is to implement the StorageClient interface and the six other storage "subclients" (DatasetClient, DatasetCollectionClient and so on). Then, it will be possible to swap in this implementation instead of now-default memory storage. The solution should be general enough to work on any common relational database, not just SQL - that's why we were discussing ORMs. I'm not sure now if we managed to settle on any particular library - maybe @vdusek knows more?
Once again, the StorageClient is going through some sizable changes, so if you want your work to be merged eventually, you'll need to update it once we finish those. Or perhaps you could start working on it last and maybe the refactor will be done by then 🙂
Hi @JoeRamaci, if you're interested in this, I believe it would be better and more beneficial for everyone if you start with the memory-storage-refactor branch. It introduces a much-simplified interface for storage clients, making it easier to implement custom storage clients.
For details on the new interface, take a look at the base storage clients. You can also check out to the implementations of the memory and file-system clients for inspiration.
not just SQL - that's why we were discussing ORMs. I'm not sure now if we managed to settle on any particular library - maybe @vdusek knows more?
We decided to utilize SQLAlchemy, primarily due to its integration with Pydantic via SQLModel. This combination will allow us to reduce code duplication (e.g. only a single model for Request for both Pydantic and ORM).
Good luck!
Understood @vdusek, I'll work off of that branch. Thanks for all the help guys!