crawlee-python
crawlee-python copied to clipboard
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works...
- For now, let's use a data file containing fingerprints (or at minimum user agents) from the Apify fingerprint dataset. - Use the init script from the [fingerprint suite](https://github.com/apify/fingerprint-suite/blob/master/packages/fingerprint-injector/src/utils.js). -...
There is useful configuration to `json.dump()` which I'd like to pass through `await crawler.export_data("export.json")`, but I see no way to do that: - `ensure_ascii` - as someone living in a...
Example ```python async def main() -> None: crawler = HttpCrawler() # Define the default request handler, which will be called for every request. @crawler.router.default_handler async def request_handler(context: HttpCrawlingContext) -> None:...
Consider this sample program: ```python import asyncio from crawlee.configuration import Configuration from crawlee.parsel_crawler import ParselCrawler, ParselCrawlingContext async def default_handler(context: ParselCrawlingContext) -> None: for category in context.selector.xpath( '//div[@class="side_categories"]//ul/li/ul/li/a' ): await context.push_data({"category":...
- We should create a new documentation guide about how to not get blocked. - Inspiration: https://crawlee.dev/docs/guides/avoid-blocking - This should be done once fingerprint-related issues are done (#401, #402).
- Implement "max crawl depth" / "crawling depth limit" - See https://github.com/apify/crawlee-python/discussions/441 - The depth information should be stored in the `Request` (`user_data` -> `crawlee_data`)
- https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L457-L463 could serve as inspiration - we should consider also making this a method of the kv-store resource client and implementing it separately for memory storage and for platform...
- Modify the `extended_unique_key` computation to include a set of predefined HTTP headers, alongside the existing normalized URL and payload. - Only include headers from the whitelist. - Identify which...
### Description - We had `data` and `payload` fields on the `Request` model. - `payload` was not being provided to the HTTP clients, only the `data` field. - ~In this...
Currently, we format the changelog to have fully qualified links to GH issues/PRs/users, but the GH release notes don't understand this properly and no longer render the user icons or...