Jindřich Bär

Results 39 issues of Jindřich Bär

The `parseSitemap` helper function does quite a lot of crawling internally. Currently, it's hardcoded to use `got-scraping` for all HTTP requests to pull the sitemap files. We're planning to phase...

feature
t-tooling

The `cheerio-stop-resume-ts` E2E test ([link](https://github.com/apify/crawlee/blob/3cd85abebc1335155518717fad81a2d544a54f9d/test/e2e/cheerio-stop-resume-ts/actor/main.ts)) seems to randomly stall when restarting the Actor stopped with `.stop()`. https://github.com/apify/crawlee/blob/3cd85abebc1335155518717fad81a2d544a54f9d/test/e2e/cheerio-stop-resume-ts/actor/main.ts#L27-L30 This happens seemingly randomly during Apify Platform E2E runs and has been happening...

t-tooling

Some sitemaps (e.g. https://docs.superjoin.ai/sitemap.xml) return 404 for the current `utils/sitemap` value of `Accept:` HTTP header. https://github.com/apify/crawlee/blob/615c8f9f691fab70d15be84c2ccff29daab4e55e/packages/utils/src/internals/sitemap.ts#L261 Investigate why this was required in the first place (maybe it actually wasn't). If...

t-tooling

Since in SDK we want to switch from custom implementation of `getPublicUrl` to calling the client implementations of those methods (see https://github.com/apify/apify-sdk-js/issues/433), we should add those methods to the local...

t-tooling

When `maxRequestsPerCrawl` is used, the crawler doesn't enqueue over this limit. This saves RQ writes. The limit doesn't understand RQ deduplication, though: - `maxRequestsPerCrawl` is e.g. `10` - The first...

t-tooling

If the parameter / return type is wrapped in utility types (adding / omitting / excluding some properties or types), the Docusaurus plugin resolves the utility types, which results in...

t-tooling

Resource management is currently done in multiple places (`BrowserPool`, `SessionPool`, `ProxyConfiguration`...), which leads to complexity and potential resource conflicts. Typical issue: ```typescript const crawler = new PlaywrightCrawler({ proxyConfiguration: new ProxyConfiguration({...

t-tooling

The Node.JS process using a `napi-rs`-built package hangs indefinitely when it should exit on Windows. At this stage, the process doesn't accept any signals (CTRL+C etc.) and has to be...

bug

The `preLaunchHook` in Web Scraper initializes DevToolsServer: https://github.com/apify/apify-sdk-js/blob/48e4c51241b45ca1e526e0ad45edef2127616650/packages/actor-scraper/web-scraper/src/internals/crawler_setup.ts#L318-L326 In case the page / browser crash (see highlighted section in the screenshot below), the dev tools server doesn't exit correctly and...

t-tooling