crawlee-python
crawlee-python copied to clipboard
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works...
These items are currently blocked but should be resolved before the public launch (8. 7.). ### TODO - [x] Replace all occurrences of `apify.github.io/crawlee-python` with `crawlee.dev/python` in `README.md` once the...
- Only markdown content. - Inspiration: https://crawlee.dev/docs/guides. - Some content from old readme could be copied in - https://github.com/apify/crawlee-python/blob/v0.0.7/README.md.
similar to what we're implementing in JS crawlee
- configurable interval - configurable status message callback (constructor parameter, property or decorator?) - we periodically set the crawler status via storage client - in javascript crawlee, this does nothing...
This change makes https://github.com/apify/apify-sdk-python/blob/162ce1080d024fe2cf399534e8f960a584524232/tests/unit/actor/test_actor_memory_storage_e2e.py#L54 pass again. The PR is a draft, it exists mostly so that I don't lose or forget this.
- Currently, there is only a dummy version of `Snapshotter._snapshot_client()` without a real measurement. - Once `StorageClient` is implemented, use it there to measure the real values. - Check TypeScript...
- Fetching requests from `RequestQueue` is sometimes very slow and can get stuck for a while. - I turned on logging and reproduced the issue with the following code: ```python...
https://github.com/apify/crawlee-python/blob/896501edb44f801409fec95cb3e5f2bcfcb4188d/src/crawlee/beautifulsoup_crawler/beautifulsoup_crawler.py#L86 can be used as reference