crawlee-python icon indicating copy to clipboard operation
crawlee-python copied to clipboard

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works...

Results 138 crawlee-python issues
Sort by recently updated
recently updated
newest added

- For now, let's use a data file containing fingerprints (or at minimum user agents) from the Apify fingerprint dataset. - Use the init script from the [fingerprint suite](https://github.com/apify/fingerprint-suite/blob/master/packages/fingerprint-injector/src/utils.js). -...

enhancement
t-tooling

There is useful configuration to `json.dump()` which I'd like to pass through `await crawler.export_data("export.json")`, but I see no way to do that: - `ensure_ascii` - as someone living in a...

enhancement
t-tooling
hacktoberfest

Example ```python async def main() -> None: crawler = HttpCrawler() # Define the default request handler, which will be called for every request. @crawler.router.default_handler async def request_handler(context: HttpCrawlingContext) -> None:...

bug
t-tooling

Consider this sample program: ```python import asyncio from crawlee.configuration import Configuration from crawlee.parsel_crawler import ParselCrawler, ParselCrawlingContext async def default_handler(context: ParselCrawlingContext) -> None: for category in context.selector.xpath( '//div[@class="side_categories"]//ul/li/ul/li/a' ): await context.push_data({"category":...

bug
t-tooling

- We should create a new documentation guide about how to not get blocked. - Inspiration: https://crawlee.dev/docs/guides/avoid-blocking - This should be done once fingerprint-related issues are done (#401, #402).

documentation
t-tooling
hacktoberfest

- Implement "max crawl depth" / "crawling depth limit" - See https://github.com/apify/crawlee-python/discussions/441 - The depth information should be stored in the `Request` (`user_data` -> `crawlee_data`)

enhancement
t-tooling
hacktoberfest

- https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L457-L463 could serve as inspiration - we should consider also making this a method of the kv-store resource client and implementing it separately for memory storage and for platform...

enhancement
t-tooling
hacktoberfest

- Modify the `extended_unique_key` computation to include a set of predefined HTTP headers, alongside the existing normalized URL and payload. - Only include headers from the whitelist. - Identify which...

enhancement
t-tooling
hacktoberfest

### Description - We had `data` and `payload` fields on the `Request` model. - `payload` was not being provided to the HTTP clients, only the `data` field. - ~In this...

t-tooling
tested

Currently, we format the changelog to have fully qualified links to GH issues/PRs/users, but the GH release notes don't understand this properly and no longer render the user icons or...

t-tooling
hacktoberfest