feat: allow using other HTTP clients
- closes #2659
See https://gist.github.com/janbuchar/3a4724927de2c3a0bb16c46bb5940236 for an example curl-impersonate client.
The following got-scraping options were ignored (they will still work, but they're not part of the new interface):
- decompress,
- resolveBodyOnly,
- allowGetBody,
- dnsLookup,
- dnsCache,
- dnsLookupIpVersion,
- retry,
- hooks,
- parseJson,
- stringifyJson,
- request,
- cache,
- cacheOptions,
- http2
- https
- agent
- localAddress
- createConnection
- pagination
- setHost
- maxHeaderSize
- methodRewriting
- enableUnixSockets
- context
So if I understand this correctly, we cannot get rid of got-scraping just yet because of the breaking change in BasicCrawlingContext.sendRequest, right?
Does it mean we have to model the BaseHttpClient to follow the got-scraping interfaces, though? It would make more sense to me to make BaseHttpClient independent (or have it follow some well-known API like fetch) and only care for the got-scraping interface where it matters, i.e. sendRequest.
In sendRequest, we would then translate from the got-scraping interface to the independent fetch API (and, in case of GotScrapingClient, later from fetch API back to the got-scraping calls).
Also, a completely unrelated thought / nit, but what file and directory casing convention are we following in Crawlee? It's already messy in the current codebase, but seeing snake_case files together with kebab-case in this PR made me think about it once again 😅
So if I understand this correctly, we cannot get rid of
got-scrapingjust yet because of the breaking change inBasicCrawlingContext.sendRequest, right?
Well, yeah, since got is part of our public API, we cannot completely remove it :shrug:. We'll see what our default http client for 4.0 will be.
Does it mean we have to model the
BaseHttpClientto follow thegot-scrapinginterfaces, though? It would make more sense to me to makeBaseHttpClientindependent (or have it follow some well-known API likefetch) and only care for thegot-scrapinginterface where it matters, i.e.sendRequest.
Not really, but I'd prefer to implicitly support got stuff that won't be part of the http client interface (e.g., parseJson or cacheOptions, and this way, we can do it kinda easily with index signatures on http request/response. It would be doable if we used a fetch-like interface, but a bit more difficult.
Also, a completely unrelated thought / nit, but what file and directory casing convention are we following in Crawlee? It's already messy in the current codebase, but seeing snake_case files together with kebab-case in this PR made me think about it once again 😅
Well, I didn't notice any, so I just look at the nearest files and do my best effort.
Historically things were snake case, but I believe nobody from the current team is in favor of that (I never like this). I would go with kebab case for anything new, and unify in v4 maybe (or sooner, I don't think it has any effect downstream, we don't allow deep imports anywhere).
Alright, I just went through it; it seems like reasonable stuff to me 👍🏽 I dislike the idea of responseType on Request from got, but the generic typing looks good, meaning I would happily keep it until the API change.
Two ideas:
- The file naming (
kebab-case) from the review above - Some tests, perhaps? Those IMO don't have to be as extensive as the
curl-impersonategist, but showing that this all works (and how it works) would be nice.
Also, it would be cool if we used this for network request mocking in new tests — that is, if the minimal HTTP client setup is quicker than setting up a mock server.
Alright, I just went through it; it seems like reasonable stuff to me 👍🏽 I dislike the idea of
responseTypeonRequestfromgot, but the generic typing looks good, meaning I would happily keep it until the API change.
I'm still in favor of removing it right now (https://github.com/apify/crawlee/pull/2661#discussion_r1787906158).
- The file naming (
kebab-case) from the review above
Sure thing, kebabs all the way :tada:
- Some tests, perhaps? Those IMO don't have to be as extensive as the
curl-impersonategist, but showing that this all works (and how it works) would be nice.
I'd argue that since this really is just a refactor, new tests shouldn't be necessary
Also, it would be cool if we used this for network request mocking in new tests — that is, if the minimal HTTP client setup is quicker than setting up a mock server.
Yeah, for unit tests, it'd definitely be nice if we could just mock this service out. Would that qualify as a test of this PR as well for you? :slightly_smiling_face:
I'd argue that since this really is just a refactor, new tests shouldn't be necessary
This is a feature, not a refactor. We are adding a new public option that enables new behavior.
But I agree that we don't need many tests given things should be well covered by the existing tests. I would still like to see at least one e2e test using the curl impersonate client.
new tests shouldn't be necessary
Sure, I'd just like to see the "minimal HTTP client example", but you're probably right; we don't need to add more tests because of that.
...could just mock this service out. Would that qualify as a test of this PR as well for you?
See above - I'd really just want to see how nifty the usage is. There is no need to replace it in all the tests, but e.g. one HttpCrawler test where we just pass the httpClient that always returns "Hello world!" and test whether everything works passed as expected would be nice.
Nevermind, I'm dense. The current HttpCrawler tests obviously already use these changes, so everything works as expected 🤦🏽♂️ Still, one little test with a custom httpClient maybe?
This is a feature, not a refactor. We are adding a new public option that enables new behavior.
This also reminds me we need some docs around it. It would be nice to have a guide on how to create the fetch HTTP client.
This is a feature, not a refactor. We are adding a new public option that enables new behavior.
This also reminds me we need some docs around it. It would be nice to have a guide on how to create the
fetchHTTP client.
Absolutely. I just have this hunch against advertising this interface - we will probably break it in just a couple of months.
Well, we will break it with a breaking release, that's fine. We could as well state that this is an experimental API subject to change. Either way, we need to document this.
@barjin @B4nan I added a basic test and a guide (I'll pull out the code samples later). Can you check it out? The request type will sadly have to stay this way.
Then how can I send a stream request from the router handler using sendRequest?
crawlingContext just only exposes a sendRequest property by createSendRequest.
Then how can I send a stream request from the router handler using
sendRequest?
crawlingContextjust only exposes asendRequestproperty bycreateSendRequest.
That's the point, you don't have to change anything in your sendRequest calls. You just use https://crawlee.dev/api/basic-crawler/interface/BasicCrawlerOptions#httpClient to an alternative implementation of BaseHttpClient and you're done.