crawlee-python icon indicating copy to clipboard operation
crawlee-python copied to clipboard

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works...

Results 138 crawlee-python issues
Sort by recently updated
recently updated
newest added

I believe we could establish some kind of a relationship between `Snapshotter` parameters related to memory usage (`max_memory_size`, `max_used_memory_ratio`, `reserve_memory_ratio`) and similar configuration attributes (`memory_mbytes`, `available_memory_ratio`)

enhancement
t-tooling

- the most notable missing feature is limiting the number of requests stored in memory - also there's locking - #94 is closely related

t-tooling
debt

At the moment, the Crawlee CLI provides two options for creating a crawler in the CLI: 1. Beautiful Soup 2. Playwright It would be great if we could add more...

enhancement
t-tooling

https://github.com/orhun/git-cliff/pull/744 Introduced this functionality - let's use it after it's released

t-tooling

See https://github.com/janbuchar/crawlee-python-demo

bug
t-tooling

Try to experiment with [PlaywrightBrowserController](https://github.com/apify/crawlee-python/blob/master/src/crawlee/browsers/playwright_browser_controller.py) to determine whether opening new Playwright pages in tabs offers better performance compared to opening them in separate windows (current state).

t-tooling
solutioning

- Implement a fingerprint generator for generating real-world HTTP headers. - Integrate it into [HttpxHttpClient](https://github.com/apify/crawlee-python/blob/master/src/crawlee/http_clients/httpx.py). - [fingerprint-generator](https://github.com/apify/fingerprint-suite/tree/master/packages/fingerprint-generator) in TS could serve as an inspiration.

enhancement
t-tooling

Check the fingerprint injector in the TS version - [fingerprint-injector](https://github.com/apify/fingerprint-suite/tree/master/packages/fingerprint-injector), and implement a similar one for `PlaywrightCrawler` in Python.

enhancement
t-tooling

I am trying to write a simple function to crawl a website and I don't want crawlee to cache anything (each time I call this function it will do everything...

bug
t-tooling

Currently, we have the following inheritance chains: - `BasicCrawler` -> `HttpCrawler` - `BasicCrawler` -> `BeautifulSoupCrawler` - `BasicCrawler` -> `PlaywrightCrawler` - `BasicCrawler` -> `ParselCrawler` (#348 ) This is an intentional difference...

t-tooling
solutioning