crawlee-python icon indicating copy to clipboard operation
crawlee-python copied to clipboard

Implement fingerprinting

Open vdusek opened this issue 1 year ago • 3 comments

Coordinate with @barjin before implementing anything.

There is a possibility of developing a dedicated fingerprinting library (in Rust?). In that case, we will do just some wrapping in Python tooling (same in JavaScript).

vdusek avatar Mar 26 '24 15:03 vdusek

Just to make things clear - there are two different initiatives regarding the "stealth" scraping:

  • fingerprint-suite (github): a bunch of libraries for generating real-life HTTP header sets (making sure the user-agent matches the os etc.) and injecting those in browsers / http clients.

This works already and there is little to no work to be done on the JS side (aside from maintenance). Unfortunately, this is all written in Javascript, so has to be completely rewritten in Python if we want to do the same thing.

  • An HTTP client in Rust - this should be an alternative for requests (in Python) and axios/fetch/... (in JavaScript). The standard HTTP clients in languages usually exercise very obvious behavior (using certain TLS ciphers, sending specific headers, etc.) and we cannot change this. Therefore - Rust (you can play with the TLS stack more there).

^--- This, we don't have anywhere.

barjin avatar Mar 27 '24 10:03 barjin

Maturin could be useful.

Build and publish crates with pyo3, cffi and uniffi bindings as well as rust binaries as python packages

vdusek avatar Apr 14 '24 07:04 vdusek

Hey

This may be useful to you.

There is a project - https://github.com/FlorianREGAZ/Python-Tls-Client which is a Python wrapper around the Golang library - https://github.com/bogdanfinn/tls-client

The disadvantage of python-tls is that it doesn't implement asynchrony. So this project is not suitable for crawlee-python at this stage.

But you might consider implementing your wrapper for https://github.com/bogdanfinn/tls-client.

Mantisus avatar Jun 25 '24 04:06 Mantisus

Closing this one, as its content was divided into several smaller issues: #292, #401, #402.

vdusek avatar Aug 30 '24 12:08 vdusek