crawlee
crawlee copied to clipboard
Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and o...
### Which package is the feature request for? If unsure which one to select, leave blank @crawlee/core ### Feature Currently, there are only very limited methods for a user to...
- There are multiple opportunities throughout Crawlee to improve type safety and optionally introduce runtime validation: - Dataset items (`CrawlingContext.pushData`) - Key-value store content - `Request.userData` - Request routing labels...
Resource management is currently done in multiple places (`BrowserPool`, `SessionPool`, `ProxyConfiguration`...), which leads to complexity and potential resource conflicts. Typical issue: ```typescript const crawler = new PlaywrightCrawler({ proxyConfiguration: new ProxyConfiguration({...
- there is a lot of data related to a single invocation of the `run()` method in the class - `stats` - `autoscaledPool` - `running` - `crawlingContexts` (might make sense...
- The property essentially duplicates what is already present in `RequestList` and `RequestQueue` - this brings no benefit and leads to confusion
- expose an `extractLinks` helper for additional flexibility - get rid of the `requestQueue` argument - get rid of `pseudoUrls` - depends on #2479