crawlee-python
crawlee-python copied to clipboard
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works...
- the `ProactorEventLoop` used by asyncio on Windows does not implement `add_signal_handler` - on UNIX, we use it to catch sigint early, print a message and cancel the task that...
Coordinate with @barjin before implementing anything. There is a possibility of developing a dedicated fingerprinting library (in Rust?). In that case, we will do just some wrapping in Python tooling...
### Context A while ago, Honza Javorek raised some good points regarding the deduplication process in the request queue ([#190](https://github.com/apify/apify-sdk-python/issues/190)). The first one: > Is it possible that Apify's request...
### Description - Enhance the testing of PlaywrightCrawler by adding a mocked Playwright API. - It will provide more isolated & stable testing environment, similar to how we use HTTPX...
Generate CHANGELOG from the commit messages as we do in JS/TS projects. Once this is solved for this repository, please create the same issue in the SDK, Client, and Shared...
- https://crawlee.dev/api/core/function/useState
- Enhance testing for `wait_for_all_requests_to_be_added=False` scenario in `Request.Queue.add_requests_batched` - Based on the https://github.com/apify/crawlee-python/pull/186#discussion_r1642398284.
- Naming `browsers/browser_plugin.py` vs `browsers/browser_factory.py` (or `BrowserControllerFactory`). - "Plugin" is the old name and doesn't quite fit the current use case. "Factory," on the other hand, seems to be a...
The purpose of the fields is somewhat unclear, but it's certain that they don't belong to the `Request` class. We should definitely explore the notion of an internal request in...