crawlee icon indicating copy to clipboard operation
crawlee copied to clipboard

Enable switching HTTP client in `parseSitemap`

Open barjin opened this issue 6 months ago • 1 comments

The parseSitemap helper function does quite a lot of crawling internally. Currently, it's hardcoded to use got-scraping for all HTTP requests to pull the sitemap files. We're planning to phase out got-scraping with Crawlee v4.

It would only make sense for parseSitemap to accept httpClient option like the crawler instances do.

Motivation

Impit is a more customizable HTTP client than got-scraping.

Ideal solution or implementation, and any additional constraints

Fairly simple, add one parameter and call HttpClient.stream instead of got-scraping.stream

Alternative solutions or implementations

No response

Other context

No response

barjin avatar Jun 27 '25 11:06 barjin

parseSitemap lives in @crawlee/utils, HttpClient in @crawlee/core. @crawlee/utils likely shouldn't depend on the core package. We'll likely have to extract HttpClient into a separate package (or utils?).

barjin avatar Jun 30 '25 08:06 barjin

Closed by #3306

barjin avatar Dec 18 '25 14:12 barjin