crawlee icon indicating copy to clipboard operation
crawlee copied to clipboard

Allow passing in a custom request adapter to HttpCrawler and BasicCrawler

Open foxt451 opened this issue 1 year ago • 0 comments

Which package is the feature request for? If unsure which one to select, leave blank

@crawlee/http (HttpCrawler), @crawlee/basic-crawler (BasicCrawler)

Feature

Add an option to provide a custom http adapter client to HttpCrawler and BasicCrawler and use it in _requestAsBrowser and sendRequest functions.

Motivation

Currently HttpCralwer and BasicCrawler are hard-wired to use the gotScraping import instance from the got-scraping lib. It makes it really inconvenient to extend default browser mimicking behaviour if, for example, you'd like to alter the tls hooks provided in that lib. Or if you'd like to use a different lib altogether, like axios. You can modify gotOptions in the preNavigationHook, but the got-scraping Got instance will still be used (and only after the hook), so you can't do something as convenient as:

import { gotScraping } from "got-scraping"
const newInstance = gotScraping.extend({
....
})

// pas your new instance to the crawler
...

And, obviously can't switch request libs to your liking.

Ideal solution or implementation, and any additional constraints

Add something like httpAdapter property to BasicCrawlerOptions that will be default-initialized to gotScraping and assign it to the public modifiable field in the constructor. Then use this field in _requestAsBrowser, the only place gotScraping is currently used to make requests in HttpCrawler, and sendRequest - BasicCrawler. The object passed to httpAdapter will have to adhere to some common interface, that will likely, but not necessarily resemble Got's interface (only the part of it that is used by crawlee, it should be minimal for easy ad-hoc implementation). You can also add more customization by moving httpAdapter property to HttpCrawlerOptions and add sendRequestAdapter to BasicCrawlerOptions; HttpCrawler will then default-assign httpAdapter to sendRequestAdapter, but it will be possible to customize both. It would also make sense to allow the callers of sendRequest to override the adapter on every call

Alternative solutions or implementations

No response

Other context

No response

foxt451 avatar Aug 25 '23 13:08 foxt451