crawlee icon indicating copy to clipboard operation
crawlee copied to clipboard

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and o...

Results 265 crawlee issues
Sort by recently updated
recently updated
newest added

Allows to keep the crawler running even when the queue is empty. Use `crawler.teardown()` to stop it. Closes #1436

Benchmarks (100 `crawlee.dev` URLs): - `HttpCrawler`: 1.5s - `CheerioCrawler`: 4s - `LinkeDOMCrawler`: 8s Why not JSDOM: https://github.com/jsdom/jsdom/issues/2005 IOW it's very slow Why not happy-dom: no idea if it's possible JS...

The PR adds the error to the crawling context for the error handlers. Since you can throw *anything*, the correct type for errors in the error handlers is unknown instead...

**Describe the feature** Sometimes we want to keep the crawler alive even if the request queue is empty. Current workaround is to override the `isFinishedFunction` of `AutoscaledPool`

feature

[![Mend Renovate](https://app.renovatebot.com/images/banner.svg)](https://renovatebot.com) This PR contains the following updates: | Package | Change | Age | Adoption | Passing | Confidence | |---|---|---|---|---|---| | [ws](https://togithub.com/websockets/ws) | [`^7.5.9` -> `^8.0.0`](https://renovatebot.com/diffs/npm/ws/7.5.9/8.8.1) | [![age](https://badges.renovateapi.com/packages/npm/ws/8.8.1/age-slim)](https://docs.renovatebot.com/merge-confidence/)...

[![Mend Renovate](https://app.renovatebot.com/images/banner.svg)](https://renovatebot.com) This PR contains the following updates: | Package | Change | Age | Adoption | Passing | Confidence | |---|---|---|---|---|---| | [puppeteer](https://togithub.com/puppeteer/puppeteer) | [`>= 9.x ` `16.1.0`](https://renovatebot.com/diffs/npm/puppeteer/14.4.1/16.1.0) |...

``./apify_storage/key_value_stores`` is a generated dir I think the path of ``INPUT.json`` should be ``./INPUT.json``

feature

**Describe the bug** When first async storage method is invoked, the default storages are automatically purged. If there are more racing promises (e.g. via `Promise.all`) doing this, there is a...

bug

**Describe the feature** Add a list of error messages and count of how much they occurred in the `Statistics` object. Let's start with storing 100 messages, trimming to the first...

feature

There are several issues now which are related to the way we handle HTTP status codes in crawlers. - `CheerioCrawler` throws an exception when it encounters a 500+ status code...

Epic
t-tooling