crawlee-python
crawlee-python copied to clipboard
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works...
I believe we could establish some kind of a relationship between `Snapshotter` parameters related to memory usage (`max_memory_size`, `max_used_memory_ratio`, `reserve_memory_ratio`) and similar configuration attributes (`memory_mbytes`, `available_memory_ratio`)
- the most notable missing feature is limiting the number of requests stored in memory - also there's locking - #94 is closely related
At the moment, the Crawlee CLI provides two options for creating a crawler in the CLI: 1. Beautiful Soup 2. Playwright It would be great if we could add more...
https://github.com/orhun/git-cliff/pull/744 Introduced this functionality - let's use it after it's released
See https://github.com/janbuchar/crawlee-python-demo
Try to experiment with [PlaywrightBrowserController](https://github.com/apify/crawlee-python/blob/master/src/crawlee/browsers/playwright_browser_controller.py) to determine whether opening new Playwright pages in tabs offers better performance compared to opening them in separate windows (current state).
- Implement a fingerprint generator for generating real-world HTTP headers. - Integrate it into [HttpxHttpClient](https://github.com/apify/crawlee-python/blob/master/src/crawlee/http_clients/httpx.py). - [fingerprint-generator](https://github.com/apify/fingerprint-suite/tree/master/packages/fingerprint-generator) in TS could serve as an inspiration.
Check the fingerprint injector in the TS version - [fingerprint-injector](https://github.com/apify/fingerprint-suite/tree/master/packages/fingerprint-injector), and implement a similar one for `PlaywrightCrawler` in Python.
I am trying to write a simple function to crawl a website and I don't want crawlee to cache anything (each time I call this function it will do everything...
Currently, we have the following inheritance chains: - `BasicCrawler` -> `HttpCrawler` - `BasicCrawler` -> `BeautifulSoupCrawler` - `BasicCrawler` -> `PlaywrightCrawler` - `BasicCrawler` -> `ParselCrawler` (#348 ) This is an intentional difference...