spider icon indicating copy to clipboard operation
spider copied to clipboard

Use custom request library

Open alexkreidler opened this issue 1 year ago • 1 comments

Reqwest is a great library, but to get around certain anti-bot/CDN software, it is necessary to use a library that impersonates web browser's TLS fingerprints, like https://github.com/4JX/reqwest-impersonate or https://github.com/penumbra-x/rquest. Both libraries also generate the user-agent and other headers.

It would be great if we could use a custom library to make requests. Since both libraries expose the same Response struct from reqwest, it should hopefully be easy to swap the implementations, but this probably depends on how much other reqwest-specific code there is in Spider. It may be worth creating a few traits for client, request, and response interfaces that multiple libraries can implement.

alexkreidler avatar Oct 22 '24 20:10 alexkreidler

Hi @alexkreidler, I like this idea. We do some of the custom headers handling and agents in the crate. It would be great to make a new custom request library to isolate these changes even more. One of the reasons this could help is with proxies. The original request crate does not perform any rotations even if you pass in multiple proxies. We can add custom logic to handle all of the edge cases for web crawling with the new client. This will also help with using one off request for single pages since right now a lot of the logic is baked into the core of the crate without much room to customize easily.

j-mendez avatar Oct 22 '24 21:10 j-mendez

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Nov 22 '24 02:11 github-actions[bot]

This issue was closed because it has been stalled for 5 days with no activity.

github-actions[bot] avatar Nov 27 '24 02:11 github-actions[bot]