crawlee-python
crawlee-python copied to clipboard
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works...
### Description Currently, we have a `MemoryStorageClient`, that can persist the data in the file system. Let's separate them, `FilesystemStorageClient` could probably extend `MemoryStorageClient` ### Other related things - There...
The current implementation is very basic and mostly serves for testing. We should make it more like https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_list.ts
Thanks to the Pydantic issue https://github.com/pydantic/pydantic-settings/issues/180 we cannot use the key-word argument `local_storage_dir` but `crawlee_local_storage_dir`. We also need to use type ignores there. Let's rename all the key-word arguments from...
The current [Crawlee / StorageClientManager](https://github.com/apify/crawlee-py/blob/master/src/crawlee/storage_client_manager.py) is more or less just copied from the [Python SDK / StorageClientManager](https://github.com/apify/apify-sdk-python/blob/master/src/apify/storages/storage_client_manager.py) and is extremely simple. Its primary role is to maintain and provide access...
Simplify code in `RequestQueue._ensure_head_is_non_empty` https://github.com/apify/apify-sdk-python/blob/v1.3.0/src/apify/storages/request_queue.py#L428
In the current state, we make a new logger in every module that needs to log something. There is `CrawleeLogFormatter`, which handles logging in the console. - our loggers should...
See https://github.com/apify/crawlee/blob/2d5d443da5fa701b21aec003d4d84797882bc175/packages/basic-crawler/src/internals/basic-crawler.ts#L836-L845 for inspiration
A part of the functionality has been added in #142. - grouping and summarizing errors is mostly missing - there doesn't seem to be a good reason for this to...