crawlee-python issues

Make memory-related parameters of `Snapshotter` configurable via `Configuration`

I believe we could establish some kind of a relationship between `Snapshotter` parameters related to memory usage (`max_memory_size`, `max_used_memory_ratio`, `reserve_memory_ratio`) and similar configuration attributes (`memory_mbytes`, `available_memory_ratio`)

janbuchar

enhancement

t-tooling

Unify `crawlee.memory_storage_client.request_queue_client` with JS counterpart

- the most notable missing feature is limiting the number of requests stored in memory - also there's locking - #94 is closely related

janbuchar

t-tooling

debt

Feature Request: Add More Options to Crawlee CLI for Crawler Creation

At the moment, the Crawlee CLI provides two options for creating a crawler in the CLI: 1. Beautiful Soup 2. Playwright It would be great if we could add more...

siddiqkaithodu

enhancement

t-tooling

accept patch/minor/major as the release type

https://github.com/orhun/git-cliff/pull/744 Introduced this functionality - let's use it after it's released

janbuchar

t-tooling

PlaywrightCrawler sometimes hangs after reaching max_requests_per_crawl

See https://github.com/janbuchar/crawlee-python-demo

janbuchar

bug

t-tooling

Evaluate the efficiency of opening new Playwright tabs versus windows

1

Try to experiment with [PlaywrightBrowserController](https://github.com/apify/crawlee-python/blob/master/src/crawlee/browsers/playwright_browser_controller.py) to determine whether opening new Playwright pages in tabs offers better performance compared to opening them in separate windows (current state).

vdusek

t-tooling

solutioning

Add fingerprints for HTTPX client

1

- Implement a fingerprint generator for generating real-world HTTP headers. - Integrate it into [HttpxHttpClient](https://github.com/apify/crawlee-python/blob/master/src/crawlee/http_clients/httpx.py). - [fingerprint-generator](https://github.com/apify/fingerprint-suite/tree/master/packages/fingerprint-generator) in TS could serve as an inspiration.

vdusek

enhancement

t-tooling

Add fingerprint injector for Playwright crawler

Check the fingerprint injector in the TS version - [fingerprint-injector](https://github.com/apify/fingerprint-suite/tree/master/packages/fingerprint-injector), and implement a similar one for `PlaywrightCrawler` in Python.

vdusek

enhancement

t-tooling

how can I disable cache completely ?

5

I am trying to write a simple function to crawl a website and I don't want crawlee to cache anything (each time I call this function it will do everything...

1hachem

bug

t-tooling

Reconsider crawler inheritance

5

Currently, we have the following inheritance chains: - `BasicCrawler` -> `HttpCrawler` - `BasicCrawler` -> `BeautifulSoupCrawler` - `BasicCrawler` -> `PlaywrightCrawler` - `BasicCrawler` -> `ParselCrawler` (#348 ) This is an intentional difference...

janbuchar

t-tooling

solutioning

crawlee-python
crawlee-python copied to clipboard

Metadata

Make memory-related parameters of `Snapshotter` configurable via `Configuration`

Unify `crawlee.memory_storage_client.request_queue_client` with JS counterpart

Feature Request: Add More Options to Crawlee CLI for Crawler Creation

accept patch/minor/major as the release type

PlaywrightCrawler sometimes hangs after reaching max_requests_per_crawl

Evaluate the efficiency of opening new Playwright tabs versus windows

Add fingerprints for HTTPX client

Add fingerprint injector for Playwright crawler

how can I disable cache completely ?

Reconsider crawler inheritance

← Metadata

Owner

Metadata

crawlee-python crawlee-python copied to clipboard

Metadata

← Metadata

Owner

Metadata

crawlee-python
crawlee-python copied to clipboard