crawlee-python
crawlee-python copied to clipboard
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works...
- Add an `always_enqueue` option (or use a better name for it, but avoid negative terms) as an input parameter to the `Request.from_url` constructor. - This will allow users to...
- We should create a new documentation guide on how to work with sessions (`SessionPool`). - Inspiration: https://crawlee.dev/docs/guides/session-management
- We could create a new documentation guide for the `PlaywrightCrawler` and `BrowserPool`. - The guide should include the following: - How to use `PlaywrightCrawler` and what it provides. -...
- We could create a new documentation guide for scaling the crawlers (mainly the features from `_autoscaling` subpackage). - The guide should include the following: - `ConcurrencySettings` - how users...
### Description - Split the _export_data_ function into _export_data_csv_ and _export_data_json_, and added additional configuration options using kwargs ### Issues - Closes: #526 ### Testing - Added test to check...
Description This PR introduces a maximum crawl depth feature to the Crawlee library. It allows users to restrict the crawler's depth to a specified level, enabling better control over the...
### Description This pull request introduces the `get_public_url` method to the `KeyValueStore` class. This method generates a file URL for a given key, allowing for easy access to stored files....
The motivation is to simplify working with event data in custom listeners in `apify-sdk-python` - currently listener parameters cannot be typed without reaching into the private `crawlee._events` submodule. See also...