crawlee feat: allow using other HTTP clients

closes #2659

See https://gist.github.com/janbuchar/3a4724927de2c3a0bb16c46bb5940236 for an example curl-impersonate client.

The following got-scraping options were ignored (they will still work, but they're not part of the new interface):

decompress,
resolveBodyOnly,
allowGetBody,
dnsLookup,
dnsCache,
dnsLookupIpVersion,
retry,
hooks,
parseJson,
stringifyJson,
request,
cache,
cacheOptions,
http2
https
agent
localAddress
createConnection
pagination
setHost
maxHeaderSize
methodRewriting
enableUnixSockets
context

Sep 06 '24 14:09 janbuchar

So if I understand this correctly, we cannot get rid of got-scraping just yet because of the breaking change in BasicCrawlingContext.sendRequest, right?

Does it mean we have to model the BaseHttpClient to follow the got-scraping interfaces, though? It would make more sense to me to make BaseHttpClient independent (or have it follow some well-known API like fetch) and only care for the got-scraping interface where it matters, i.e. sendRequest.

In sendRequest, we would then translate from the got-scraping interface to the independent fetch API (and, in case of GotScrapingClient, later from fetch API back to the got-scraping calls).

Sep 30 '24 10:09 barjin

Also, a completely unrelated thought / nit, but what file and directory casing convention are we following in Crawlee? It's already messy in the current codebase, but seeing snake_case files together with kebab-case in this PR made me think about it once again 😅

Sep 30 '24 10:09 barjin

So if I understand this correctly, we cannot get rid of got-scraping just yet because of the breaking change in BasicCrawlingContext.sendRequest, right?

Well, yeah, since got is part of our public API, we cannot completely remove it :shrug:. We'll see what our default http client for 4.0 will be.

Does it mean we have to model the BaseHttpClient to follow the got-scraping interfaces, though? It would make more sense to me to make BaseHttpClient independent (or have it follow some well-known API like fetch) and only care for the got-scraping interface where it matters, i.e. sendRequest.

Not really, but I'd prefer to implicitly support got stuff that won't be part of the http client interface (e.g., parseJson or cacheOptions, and this way, we can do it kinda easily with index signatures on http request/response. It would be doable if we used a fetch-like interface, but a bit more difficult.

Also, a completely unrelated thought / nit, but what file and directory casing convention are we following in Crawlee? It's already messy in the current codebase, but seeing snake_case files together with kebab-case in this PR made me think about it once again 😅

Well, I didn't notice any, so I just look at the nearest files and do my best effort.

Sep 30 '24 10:09 janbuchar

Historically things were snake case, but I believe nobody from the current team is in favor of that (I never like this). I would go with kebab case for anything new, and unify in v4 maybe (or sooner, I don't think it has any effect downstream, we don't allow deep imports anywhere).

Sep 30 '24 10:09 B4nan

Alright, I just went through it; it seems like reasonable stuff to me 👍🏽 I dislike the idea of responseType on Request from got, but the generic typing looks good, meaning I would happily keep it until the API change.

Two ideas:

The file naming (kebab-case) from the review above
Some tests, perhaps? Those IMO don't have to be as extensive as the curl-impersonate gist, but showing that this all works (and how it works) would be nice.

Also, it would be cool if we used this for network request mocking in new tests — that is, if the minimal HTTP client setup is quicker than setting up a mock server.

Oct 08 '24 09:10 barjin

Alright, I just went through it; it seems like reasonable stuff to me 👍🏽 I dislike the idea of responseType on Request from got, but the generic typing looks good, meaning I would happily keep it until the API change.

I'm still in favor of removing it right now (https://github.com/apify/crawlee/pull/2661#discussion_r1787906158).

The file naming (kebab-case) from the review above

Sure thing, kebabs all the way :tada:

Some tests, perhaps? Those IMO don't have to be as extensive as the curl-impersonate gist, but showing that this all works (and how it works) would be nice.

I'd argue that since this really is just a refactor, new tests shouldn't be necessary

Also, it would be cool if we used this for network request mocking in new tests — that is, if the minimal HTTP client setup is quicker than setting up a mock server.

Yeah, for unit tests, it'd definitely be nice if we could just mock this service out. Would that qualify as a test of this PR as well for you? :slightly_smiling_face:

Oct 08 '24 09:10 janbuchar

I'd argue that since this really is just a refactor, new tests shouldn't be necessary

This is a feature, not a refactor. We are adding a new public option that enables new behavior.

But I agree that we don't need many tests given things should be well covered by the existing tests. I would still like to see at least one e2e test using the curl impersonate client.

Oct 08 '24 09:10 B4nan

new tests shouldn't be necessary

Sure, I'd just like to see the "minimal HTTP client example", but you're probably right; we don't need to add more tests because of that.

...could just mock this service out. Would that qualify as a test of this PR as well for you?

See above - I'd really just want to see how nifty the usage is. There is no need to replace it in all the tests, but e.g. one HttpCrawler test where we just pass the httpClient that always returns "Hello world!" and test whether everything works passed as expected would be nice.

Nevermind, I'm dense. The current HttpCrawler tests obviously already use these changes, so everything works as expected 🤦🏽‍♂️ Still, one little test with a custom httpClient maybe?

Oct 08 '24 09:10 barjin

This is a feature, not a refactor. We are adding a new public option that enables new behavior.

This also reminds me we need some docs around it. It would be nice to have a guide on how to create the fetch HTTP client.

Oct 08 '24 10:10 B4nan

This is a feature, not a refactor. We are adding a new public option that enables new behavior.

This also reminds me we need some docs around it. It would be nice to have a guide on how to create the fetch HTTP client.

Absolutely. I just have this hunch against advertising this interface - we will probably break it in just a couple of months.

Oct 08 '24 11:10 janbuchar

Well, we will break it with a breaking release, that's fine. We could as well state that this is an experimental API subject to change. Either way, we need to document this.

Oct 08 '24 11:10 B4nan

@barjin @B4nan I added a basic test and a guide (I'll pull out the code samples later). Can you check it out? The request type will sadly have to stay this way.

Oct 08 '24 14:10 janbuchar

Then how can I send a stream request from the router handler using sendRequest?

crawlingContext just only exposes a sendRequest property by createSendRequest.

Dec 16 '24 08:12 killbus

Then how can I send a stream request from the router handler using sendRequest?

crawlingContext just only exposes a sendRequest property by createSendRequest.

That's the point, you don't have to change anything in your sendRequest calls. You just use https://crawlee.dev/api/basic-crawler/interface/BasicCrawlerOptions#httpClient to an alternative implementation of BaseHttpClient and you're done.

Dec 16 '24 12:12 janbuchar