crawlee-python issues

Add init script for Playwright browser context

- For now, let's use a data file containing fingerprints (or at minimum user agents) from the Apify fingerprint dataset. - Use the init script from the [fingerprint suite](https://github.com/apify/fingerprint-suite/blob/master/packages/fingerprint-injector/src/utils.js). -...

vdusek

enhancement

t-tooling

Implement/document a way how to pass extra configuration to json.dump()

7

There is useful configuration to `json.dump()` which I'd like to pass through `await crawler.export_data("export.json")`, but I see no way to do that: - `ensure_ascii` - as someone living in a...

honzajavorek

enhancement

t-tooling

hacktoberfest

Unable to execute POST request with JSON payload

2

Example ```python async def main() -> None: crawler = HttpCrawler() # Define the default request handler, which will be called for every request. @crawler.router.default_handler async def request_handler(context: HttpCrawlingContext) -> None:...

Mantisus

bug

t-tooling

Crawler doesn't respect `configuration` argument

1

Consider this sample program: ```python import asyncio from crawlee.configuration import Configuration from crawlee.parsel_crawler import ParselCrawler, ParselCrawlingContext async def default_handler(context: ParselCrawlingContext) -> None: for category in context.selector.xpath( '//div[@class="side_categories"]//ul/li/ul/li/a' ): await context.push_data({"category":...

tlinhart

bug

t-tooling

Create a new guide about how to not get blocked

4

- We should create a new documentation guide about how to not get blocked. - Inspiration: https://crawlee.dev/docs/guides/avoid-blocking - This should be done once fingerprint-related issues are done (#401, #402).

vdusek

documentation

t-tooling

hacktoberfest

Implement max crawl depth

2

- Implement "max crawl depth" / "crawling depth limit" - See https://github.com/apify/crawlee-python/discussions/441 - The depth information should be stored in the `Request` (`user_data` -> `crawlee_data`)

vdusek

enhancement

t-tooling

hacktoberfest

Implement `KeyValueStore.get_public_url`

3

- https://github.com/apify/crawlee/blob/master/packages/core/src/storages/key_value_store.ts#L457-L463 could serve as inspiration - we should consider also making this a method of the kv-store resource client and implementing it separately for memory storage and for platform...

janbuchar

enhancement

t-tooling

hacktoberfest

Include whitelisted HTTP headers in the `extended_unique_key` computation

2

- Modify the `extended_unique_key` computation to include a set of predefined HTTP headers, alongside the existing normalized URL and payload. - Only include headers from the whitelist. - Identify which...

vdusek

enhancement

t-tooling

hacktoberfest

fix!: merge payload and data fields of Request

6

### Description - We had `data` and `payload` fields on the `Request` model. - `payload` was not being provided to the HTTP clients, only the `data` field. - ~In this...

vdusek

t-tooling

tested

Improve changelog formatting

2

Currently, we format the changelog to have fully qualified links to GH issues/PRs/users, but the GH release notes don't understand this properly and no longer render the user icons or...

B4nan

t-tooling

hacktoberfest

crawlee-python
crawlee-python copied to clipboard

Metadata

Add init script for Playwright browser context

Implement/document a way how to pass extra configuration to json.dump()

Unable to execute POST request with JSON payload

Crawler doesn't respect `configuration` argument

Create a new guide about how to not get blocked

Implement max crawl depth

Implement `KeyValueStore.get_public_url`

Include whitelisted HTTP headers in the `extended_unique_key` computation

fix!: merge payload and data fields of Request

Improve changelog formatting

← Metadata

Owner

Metadata

crawlee-python crawlee-python copied to clipboard

Metadata

← Metadata

Owner

Metadata

crawlee-python
crawlee-python copied to clipboard