spider
spider copied to clipboard
Use custom request library
Reqwest is a great library, but to get around certain anti-bot/CDN software, it is necessary to use a library that impersonates web browser's TLS fingerprints, like https://github.com/4JX/reqwest-impersonate or https://github.com/penumbra-x/rquest. Both libraries also generate the user-agent and other headers.
It would be great if we could use a custom library to make requests. Since both libraries expose the same Response struct from reqwest, it should hopefully be easy to swap the implementations, but this probably depends on how much other reqwest-specific code there is in Spider. It may be worth creating a few traits for client, request, and response interfaces that multiple libraries can implement.
Hi @alexkreidler, I like this idea. We do some of the custom headers handling and agents in the crate. It would be great to make a new custom request library to isolate these changes even more. One of the reasons this could help is with proxies. The original request crate does not perform any rotations even if you pass in multiple proxies. We can add custom logic to handle all of the edge cases for web crawling with the new client. This will also help with using one off request for single pages since right now a lot of the logic is baked into the core of the crate without much room to customize easily.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
This issue was closed because it has been stalled for 5 days with no activity.