got-scraping icon indicating copy to clipboard operation
got-scraping copied to clipboard

got-scraping inefficient against Cloudflare

Open Cooya opened this issue 2 years ago • 49 comments

Recently I have encounter some changes In Cloudflare antibot protection. While using got-scraping, I am now unable to send requests to websites protected by Cloudflare.I have to use Puppeteer to get through.

It is mentioned as well in this comment.

Any idea of how Cloudflare can be that good for detecting TLS configuration generated by got-scraping ?

Cooya avatar Apr 05 '22 09:04 Cooya

Are you using proxies? If you aren't then probably you might hit the rate limiter so it returns a JS challenge, which must be run in a real browser.

szmarczak avatar Apr 05 '22 16:04 szmarczak

Yes I am using datacenter and residential proxies. None of them work.

I think Cloudflare reached a point where they now send JS challenge to every client which does not have a common JA3 fingerprint, which explains why got-scraping is inefficient. This article confirm that hypothesis.

Unfortunately, as Firefox and Chrome have their own SSL library (with different ciphers), it is impossible in NodeJS to mimic JA3 fingerprints of Firefox and Chrome.

Cooya avatar Apr 06 '22 12:04 Cooya

I'm with the same issue, testing with Postman I saw that the order of the headers is important. We should make sure we have Host header as the first one. got-scraping changes the order of the headers, I'm debugging here to see if this is the issue. I don't think this is related to JA3 fingerprinting because we can also do POST and GET requests using simple curl to Cloudflare websites with the correct cookies.

yuriolive avatar Apr 13 '22 20:04 yuriolive

Do you have an example domain? I cannot reproduce this yet

szmarczak avatar Apr 13 '22 20:04 szmarczak

got-scraping changes the order of the headers,

Yes, it's reordering so the order is be the same as the browsers have.

szmarczak avatar Apr 13 '22 20:04 szmarczak

Do you have an example domain? I cannot reproduce this yet

Yes, you can try in https://www.g2.com

yuriolive avatar Apr 13 '22 20:04 yuriolive

I think the sortHeaders is hard coded in this line https://github.com/apify/got-scraping/blob/07ea3b43f06aa05e64857d60208b9522c1193e45/src/agent/transform-headers-agent.ts#L88

yuriolive avatar Apr 13 '22 20:04 yuriolive

Yes, you can try in https://www.g2.com

Thanks, I was finally able to reproduce this. However it randomly goes through and randomly stops. Fixing this now.

szmarczak avatar Apr 13 '22 21:04 szmarczak

Cloudflare protection has some crazy things. If you have the cf_clearance cookie the order of the headers doesn't matter, but if you just have the __cf_bm cookie the order matters. Sometimes it just set the __cf_bm and other times the page set both. You can check here if you want to see more about Cloudflare cookies. Also we have to make sure we use the same IP and User Agent that we got the cookies.

yuriolive avatar Apr 13 '22 21:04 yuriolive

I couldn't make it work with Chrome values. They're using their own implementation of SSL so it may be impossible to fix in Node. However, it seems Firefox works very nicely:

Node 17 required

const http2 = require('http2');

const session = http2.connect('https://www.g2.com', {
    ciphers: [
        // Firefox v91
        'TLS_AES_128_GCM_SHA256',
        'TLS_CHACHA20_POLY1305_SHA256',
        'TLS_AES_256_GCM_SHA384',
        'ECDHE-ECDSA-AES128-GCM-SHA256',
        'ECDHE-RSA-AES128-GCM-SHA256',
        'ECDHE-ECDSA-CHACHA20-POLY1305',
        'ECDHE-RSA-CHACHA20-POLY1305',
        'ECDHE-ECDSA-AES256-GCM-SHA384',
        'ECDHE-RSA-AES256-GCM-SHA384',
        // Legacy:
        'ECDHE-ECDSA-AES256-SHA',
        'ECDHE-ECDSA-AES128-SHA',
        'ECDHE-RSA-AES128-SHA',
        'ECDHE-RSA-AES256-SHA',
        'AES128-GCM-SHA256',
        'AES256-GCM-SHA384',
        'AES128-SHA',
        'AES256-SHA',
    ].join(':'),
    ecdhCurve: [
        'X25519',
        'prime256v1',
        'secp384r1',
        'secp521r1',
        'ffdhe2048',
        'ffdhe3072',
    ].join(':'),
    signatureAlgorithms: [
        'ecdsa_secp256r1_sha256',
        'ecdsa_secp384r1_sha384',
        'ecdsa_secp521r1_sha512',
        'rsa_pss_rsae_sha256',
        'rsa_pss_rsae_sha384',
        'rsa_pss_rsae_sha512',
        'rsa_pkcs1_sha256',
        'rsa_pkcs1_sha384',
        'rsa_pkcs1_sha512',
        'ecdsa_sha1',
        'rsa_pkcs1_sha1',
    ].join(':'),
    minVersion: 'TLSv1.2',
    maxVersion: 'TLSv1.3',
    alpnProtocols: ['h2', 'http/1.1'],
    servername: 'www.g2.com',
});

const req = session.request({
    'Host': 'www.g2.com',
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:99.0) Gecko/20100101 Firefox/99.0',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Accept-Encoding':' gzip, deflate, br',
    'DNT': '1',
    'Upgrade-Insecure-Requests': '1',
    'Sec-Fetch-Dest': 'document',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-Site': 'none',
    'Sec-Fetch-User': '?1',
}, {endStream: false});

req.on('response', headers => {
    console.log(headers[':status']);
});

req.resume();
req.end();

szmarczak avatar Apr 14 '22 00:04 szmarczak

I couldn't make it work with Chrome values. They're using their own implementation of SSL so it may be impossible to fix in Node. However, it seems Firefox works very nicely:

Node 17 required

const http2 = require('http2');

const session = http2.connect('https://www.g2.com', {
    ciphers: [
        // Firefox v91
        'TLS_AES_128_GCM_SHA256',
        'TLS_CHACHA20_POLY1305_SHA256',
        'TLS_AES_256_GCM_SHA384',
        'ECDHE-ECDSA-AES128-GCM-SHA256',
        'ECDHE-RSA-AES128-GCM-SHA256',
        'ECDHE-ECDSA-CHACHA20-POLY1305',
        'ECDHE-RSA-CHACHA20-POLY1305',
        'ECDHE-ECDSA-AES256-GCM-SHA384',
        'ECDHE-RSA-AES256-GCM-SHA384',
        // Legacy:
        'ECDHE-ECDSA-AES256-SHA',
        'ECDHE-ECDSA-AES128-SHA',
        'ECDHE-RSA-AES128-SHA',
        'ECDHE-RSA-AES256-SHA',
        'AES128-GCM-SHA256',
        'AES256-GCM-SHA384',
        'AES128-SHA',
        'AES256-SHA',
    ].join(':'),
    ecdhCurve: [
        'X25519',
        'prime256v1',
        'secp384r1',
        'secp521r1',
        'ffdhe2048',
        'ffdhe3072',
    ].join(':'),
    signatureAlgorithms: [
        'ecdsa_secp256r1_sha256',
        'ecdsa_secp384r1_sha384',
        'ecdsa_secp521r1_sha512',
        'rsa_pss_rsae_sha256',
        'rsa_pss_rsae_sha384',
        'rsa_pss_rsae_sha512',
        'rsa_pkcs1_sha256',
        'rsa_pkcs1_sha384',
        'rsa_pkcs1_sha512',
        'ecdsa_sha1',
        'rsa_pkcs1_sha1',
    ].join(':'),
    minVersion: 'TLSv1.2',
    maxVersion: 'TLSv1.3',
    alpnProtocols: ['h2', 'http/1.1'],
    servername: 'www.g2.com',
});

const req = session.request({
    'Host': 'www.g2.com',
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:99.0) Gecko/20100101 Firefox/99.0',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Accept-Encoding':' gzip, deflate, br',
    'DNT': '1',
    'Upgrade-Insecure-Requests': '1',
    'Sec-Fetch-Dest': 'document',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-Site': 'none',
    'Sec-Fetch-User': '?1',
}, {endStream: false});

req.on('response', headers => {
    console.log(headers[':status']);
});

req.resume();
req.end();

I don't think is related to JA3 fingerprint. I get the cookies from my browser that has a different fingerprint than Postman. But I'm still being able to do the request in Postman if I just make sure that I had at least the Host, Cookie and User-Agent headers in this order. Normally Postman put the Host header at the end, you have to override. But having the same JA3 is a good idea, probably other bot protections like DataDome and Akamai, use this. I saw other repo that tries to simulate the JA3 using Go https://github.com/zedd3v/mytls . They have a extensive list of hashes here https://github.com/zedd3v/mytls/blob/master/ja3.json .

yuriolive avatar Apr 14 '22 11:04 yuriolive

Managed to do Chrome:

const http2 = require('http2');

const session = http2.connect('https://www.g2.com', {
    ciphers: [
        // Chrome v92
        'TLS_AES_128_GCM_SHA256',
        'TLS_AES_256_GCM_SHA384',
        'TLS_CHACHA20_POLY1305_SHA256',
        'ECDHE-ECDSA-AES128-GCM-SHA256',
        'ECDHE-RSA-AES128-GCM-SHA256',
        'ECDHE-ECDSA-AES256-GCM-SHA384',
        'ECDHE-RSA-AES256-GCM-SHA384',
        'ECDHE-ECDSA-CHACHA20-POLY1305',
        'ECDHE-RSA-CHACHA20-POLY1305',
        // Legacy:
        'ECDHE-RSA-AES128-SHA',
        'ECDHE-RSA-AES256-SHA',
        'AES128-GCM-SHA256',
        'AES256-GCM-SHA384',
        'AES128-SHA',
        'AES256-SHA',
    ].join(':'),
    ecdhCurve: [
        'X25519',
        'prime256v1',
        'secp384r1',
    ].join(':'),
    signatureAlgorithms: [
        'ecdsa_secp256r1_sha256',
        'rsa_pss_rsae_sha256',
        'rsa_pkcs1_sha256',
        'ecdsa_secp384r1_sha384',
        'rsa_pss_rsae_sha384',
        'rsa_pkcs1_sha384',
        'rsa_pss_rsae_sha512',
        'rsa_pkcs1_sha512',
    ].join(':'),
    minVersion: 'TLSv1',
    maxVersion: 'TLSv1.3',
    alpnProtocols: ['h2', 'http/1.1'],
    servername: 'www.g2.com',
});

const req = session.request({
	":method": "GET",
	":authority": "www.g2.com",
	":scheme": "https",
	":path": "/",
	"sec-ch-ua": "\" Not A;Brand\";v=\"99\", \"Chromium\";v=\"100\"",
	"sec-ch-ua-mobile": "?0",
	"sec-ch-ua-platform": "\"Linux\"",
	"upgrade-insecure-requests": "1",
	"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.88 Safari/537.36",
	"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
	"sec-fetch-site": "none",
	"sec-fetch-mode": "navigate",
	"sec-fetch-user": "?1",
	"sec-fetch-dest": "document",
	"accept-encoding": "gzip, deflate, br",
	"accept-language": "en-US,en;q=0.9"
}, {endStream: false});

req.on('response', headers => {
    console.log(headers[':status']);
});

req.resume();
req.end();

szmarczak avatar Apr 17 '22 16:04 szmarczak

And... it stopped working for some reason

szmarczak avatar Apr 17 '22 16:04 szmarczak

And it works again... weird stuff is going on 🤔

szmarczak avatar Apr 17 '22 17:04 szmarczak

And it works again... weird stuff is going on 🤔

I think g2 website is having some downtime https://status.g2.com/ . I think GetApp website also uses Cloudflare Bot Protection https://www.getapp.com/ .

yuriolive avatar Apr 18 '22 20:04 yuriolive

Can you try [email protected] and run your code with updated all dependencies? Make sure header-generator uses the beta.

szmarczak avatar Apr 24 '22 22:04 szmarczak

Can you try [email protected] and run your code with updated all dependencies? Make sure header-generator uses the beta.

Still not working, got a different error now: Access denied Error code 1020. I will take a better look later tonight.

yuriolive avatar Apr 25 '22 15:04 yuriolive

Can you try firefox? You'd need to pass

	  headerGeneratorOptions: {
		    browsers: [
			    "firefox",
		    ],
	  },

in the options.

szmarczak avatar May 08 '22 09:05 szmarczak

Can you try firefox? You'd need to pass

	  headerGeneratorOptions: {
		    browsers: [
			    "firefox",
		    ],
	  },

in the options.

Now is working even without JS rendering to get the cookies. I think the TLS fingerprint from the chrome version I was using in the application was different and probably this was causing some issue.

yuriolive avatar May 10 '22 20:05 yuriolive

If you didn't explicitly specify Chrome version then it should just work out of the box. So to recap - is it working w/ Firefox fingerprint but does not work with Chrome fingerprint?

szmarczak avatar May 11 '22 23:05 szmarczak

@szmarczak It stopped working again, but was working locally. I still doesn't know what is being used by Cloudflare to detect, I think is more than just TLS and headers.

yuriolive avatar May 12 '22 17:05 yuriolive

We still could get detected. The current fingerprint is not a 1:1 match, but a very close one.

For example, Chrome uses BoringSSL while Node.js uses OpenSSL. Our TLS fingerprint has improved recently, however I think we've reached max and can't do better with the native tls module.

However I believe this could be worked around via NAPI.

The headers are what we definitely can keep improving. Sometimes the header-generator generates fingerprint matching old browsers, and that needs fixing. Also it's missing sec-ch-ua-platform header.

Also there's a chance that they're fingerprinting HTTP/2 session or/and stream settings, but that's very unlikely.

Another reason it can pass locally is that the local IP address has a higher trust score so Cloudflare is more forgivable.

I'll keep testing and will give an update tomorrow.

szmarczak avatar May 12 '22 19:05 szmarczak

I've tested the two websites mentioned above (with proxy on) and couldn't reproduce the issue. Can you post the options used with got-scraping? Have you used cookies?

I only got a CloudFlare challenge when I visited g2 with a real browser (got-scraping did just fine, no block 🤔). Interestingly, on Firefox I was getting a JS challenge while on Chrome I was struck with hcaptcha.

Edit: Changing IP didn't help when using real browsers.

Edit 2: I changed my UA to Windows and the block was gone.

szmarczak avatar May 14 '22 16:05 szmarczak

@Cooya do you still experience blocks with the newest version?

szmarczak avatar May 14 '22 16:05 szmarczak

As I said previously, got-scraping is inefficient for my case (https://gensdeconfiance.com). Changing the UA will not fix anything as Cloudflare rely on JA3 signature (on this website anyway).

I am now using a Go server to send my requests, which works much better.

Cooya avatar May 17 '22 11:05 Cooya

As I said previously

So you haven't tried the new version?

szmarczak avatar May 17 '22 15:05 szmarczak

I don't have any luck on duelbits.com for instance, I get error code 403, but when trying on puppeteer for instance it works and request goes through

l10r avatar May 21 '22 22:05 l10r

Thanks for feedback @l10r, looking into it.

szmarczak avatar May 22 '22 16:05 szmarczak

Thanks for feedback @l10r, looking into it.

i was testing with http2 and just with the ciphers and other tls options mentioned above, after testing with gotScraping it works just fine and i get the response, I forgot to update my progress sorry and thanks a lot

l10r avatar May 22 '22 16:05 l10r

https://github.com/lwthiker/curl-impersonate ^ Seems like this could help you a lot @szmarczak as it seems like this has much success rate...

l10r avatar May 26 '22 12:05 l10r