crawlee icon indicating copy to clipboard operation
crawlee copied to clipboard

On the same website, playwright succeeded but PlaywrightCrawler failed with error: page. goto: net:: ERR-EMPTYResponse

Open jeff1998-git opened this issue 7 months ago • 5 comments

Which package is this bug report for? If unsure which one to select, leave blank

@crawlee/playwright (PlaywrightCrawler)

Issue description

run code: Console output: use Playwright is success INFO PlaywrightCrawler: Starting the crawler. use Playwright message: 湖北省中医院

use crawlee is fail WARN PlaywrightCrawler: Reclaiming failed request back to the list or queue. page.goto: net::ERR_EMPTY_RESPONSE at https://www.hbhtcm.com/ Call log:

  • navigating to "https://www.hbhtcm.com/", waiting until "load" {"id":"dcHa1anqjn3kF91","url":"https://www.hbhtcm.com","retryCount":1}

ERROR PlaywrightCrawler: Request failed and reached maximum retries. page.goto: net::ERR_EMPTY_RESPONSE at https://www.hbhtcm.com/ Call log:

  • navigating to "https://www.hbhtcm.com/", waiting until "load"

    at gotoExtended (D:\crawleeRunVersion\notice_project\crawlee1.1.2\node_modules@crawlee\playwright\internals\utils\playwright-utils.js:165:17) at PlaywrightCrawler._navigationHandler (D:\crawleeRunVersion\notice_project\crawlee1.1.2\node_modules@crawlee\playwright\internals\playwright-crawler.js:117:52) at PlaywrightCrawler._handleNavigation (D:\crawleeRunVersion\notice_project\crawlee1.1.2\node_modules@crawlee\browser\internals\browser-crawler.js:331:52) at async PlaywrightCrawler._runRequestHandler (D:\crawleeRunVersion\notice_project\crawlee1.1.2\node_modules@crawlee\browser\internals\browser-crawler.js:260:13) at async PlaywrightCrawler._runRequestHandler (D:\crawleeRunVersion\notice_project\crawlee1.1.2\node_modules@crawlee\playwright\internals\playwright-crawler.js:114:9) at async wrap (D:\crawleeRunVersion\notice_project\crawlee1.1.2\node_modules@apify\timeout\cjs\index.cjs:54:21) {"id":"dcHa1anqjn3kF91","url":"https://www.hbhtcm.com","method":"GET","uniqueKey":"https://www.hbhtcm.com"}

so:On the same website, playwright succeeded but PlaywrightCrawler failed with error: page. goto: net:: ERR-EMPTYResponse This is very strange. and use puppeteer run this web is success Using Fiddler, proxy to 127.0.0.1:8888, playwrightCrawler is successful again, packet capture is normal This is a very strange phenomenon,I want to know what the reason is

Code sample

const url = 'https://www.hbhtcm.com';    // fail
// const url = 'https://www.baidu.com';  // success

// 1. use crawlee
const crawleeTest = new PlaywrightCrawler({

    useSessionPool: false,
    navigationTimeoutSecs: 120,
    requestHandlerTimeoutSecs: 300,
    maxRequestsPerCrawl: 50,

    launchContext: {
        // !!! You need to specify this option to tell Crawlee to use playwright-extra as the launcher !!!
        launchOptions: {
            // Other playwright options work as usual
            headless: true,
            channel: 'chrome',
            launcher: playwright.chromium,
            args: [
                '--disable-http2', // isable HTTP/2  use of HTTP/1.1

                '--no-sandbox',
                '--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
                '--disable-blink-features=AutomationControlled',
                "--ignore-certificate-errors",
                "--ignore-certificate-errors-spki-list",

            ]
        },
    },

    // ------------------------------------------
    preNavigationHooks: [
        async ({ page }) => {
            // set headers
            await page.setExtraHTTPHeaders({
                'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36 Edg/131.0.0.0',
                'sec-ch-ua': '"Microsoft Edge";v="131", "Chromium";v="131", "Not_A Brand";v="24"',
                'sec-ch-ua-mobile': '?0',
                'sec-ch-ua-platform': '"Windows"',
                'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
                'Accept-Encoding': 'gzip, deflate, br, zstd',
                'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',
                'Cache-Control': 'max-age=0',
                'Connection': 'keep-alive',
                'Sec-Fetch-Dest': 'document',
                'Sec-Fetch-Mode': 'navigate',
                'Sec-Fetch-Site': 'none',
                'Sec-Fetch-User': '?1',
                'Upgrade-Insecure-Requests': '1',
            });

            // set Cookie
            // await page.context().addCookies([
            //     { name: 'view', value: '1748198195', domain: 'www.hbhtcm.com', path: '/' },
            //     { name: 'PHPSESSID', value: 'aa5m36c3b87m38ofumijqilbvm', domain: 'www.hbhtcm.com', path: '/' },
            // ]);
        },
    ],
    // --------------------------------------------

    async requestHandler({ request, page, enqueueLinks, log }) {
        log.info(`Processing ${request.url}...`);
        const title = await page.title()
        console.log("use PlaywrightCrawler message::", title);
    },

    // This function is called if the page processing failed more than maxRequestRetries+1 times.
    failedRequestHandler({ request, log }) {
        log.info(`Request ${request.url} failed too many times.`);
    },
});

await crawleeTest.addRequests([url]);   // 

// ==================================================================
// 2. use Playwright 
const playwrightTest = async () => {

    const browser = await playwright.chromium.launch({
        headless: true,
    });

    const page = await browser.newPage();
    await page.goto(url);
    console.log('use Playwright message:', await page.title());
    await browser.close();
};

// Running results:
await Promise.all([crawleeTest.run(), playwrightTest()]);

Package version

[email protected] D:\crawleeRunVersion\notice_project\crawlee1.1.2 ├── @crawlee/[email protected] ├── [email protected] ├── [email protected] ├── [email protected] ├── [email protected] ├── [email protected] ├── [email protected] ├── [email protected] ├── [email protected] ├── [email protected] ├── [email protected] ├── [email protected] ├── [email protected] ├── [email protected] ├── [email protected] ├── [email protected] ├── [email protected] ├── [email protected] ├── [email protected] ├── [email protected] ├── [email protected] └── [email protected]

Node.js version

Node v22.12.0

Operating system

windows

Apify platform

  • [ ] Tick me if you encountered this issue on the Apify platform

I have tested this on the next release

No response

Other context

No response

jeff1998-git avatar May 28 '25 15:05 jeff1998-git

Hello, and thank you for your interest in this project!

It seems that the network proxying layer in Crawlee (proxy-chain) exhibits different behaviour than a regular browser while loading the page (https://www.hbhtcm.com).

import { chromium } from 'playwright';
import { Server as ProxyChainServer } from 'proxy-chain';

const server = new ProxyChainServer({
    port: 0,
});
await server.listen();

const url = 'https://www.hbhtcm.com';

const browser = await chromium.launch({
    headless: false,
    proxy: {                                       // comment me out to make this "work"
        server: `http://127.0.0.1:${server.port}`, // comment me out to make this "work"
    }                                              // comment me out to make this "work"
});

const page = await browser.newPage();
await page.goto(url); // Fails with `page.goto: net::ERR_EMPTY_RESPONSE at https://www.hbhtcm.com/`
await browser.close();

By the way, locally, I didn't manage to load the page even without the proxy (vanilla Playwright). The page.goto call has timed out for me. Can you confirm that your Playwright instance can load the page contents (without Crawlee)?

This might be just those two libraries behaving differently on an empty HTTP response - while proxy-chain fails immediately, the browser might try to wait for the response for longer. If this is the case, it's IMO wontfix, as both behaviours seem reasonable.

barjin avatar May 29 '25 10:05 barjin

hello,Thank you for your reply,I have reproduced your code,do not use proxy is success,use proxy is fail,but use proxy before,first open fiddler soft, change port ->8888,fiddler working prot, use proxy code running is success

In the playwrightcrawler module, when using Fiddler as a proxy, the access is successful. Without Fiddler proxy, the access fails

I don't know where the problem lies

const playwright = require('playwright'); const { Server: ProxyChainServer } = require('proxy-chain');

(async () => { // create proxy server const server = new ProxyChainServer({ port: 0, prepareRequestFunction: ({ request }) => { return { upstreamProxyUrl: null, }; }, });

await server.listen();
console.log(`proxy running,port: ${server.port}`);

const url = 'https://www.hbhtcm.com';

try {
    // 1、 run browser and use proxy
    const browser = await playwright.chromium.launch({
        headless: false,
        proxy: {
            server: `http://127.0.0.1:${server.port}`,
        },
    });

    const page = await browser.newPage();

    console.log(`page accessing: ${url}`);

    await page.goto(url, { timeout: 60000 });  // Fails with `page.goto: net::ERR_EMPTY_RESPONSE at https://www.hbhtcm.com/`

    console.log('loading page is success...');

    await page.waitForTimeout(5000);

    await browser.close();
} catch (error) {
    console.error(`Use proxy Navigation failed: ${error.message}`);

    // 2、try again --- run browser and do not use proxy 
    // Attempt to access without using a proxy and determine if it is a proxy issue
    try {
        console.log('try no proxy...');
        const directBrowser = await playwright.chromium.launch({ headless: false });
        const directPage = await directBrowser.newPage();
        await directPage.goto(url, { timeout: 60000 });
        console.log('Do not use proxy server...is success...');   // output: Do not use a proxy server...is success...
        await directBrowser.close();
    } catch (directError) {
        console.error(`Do not use proxy server..is fail...Too.: ${directError.message}`);
        console.error('What is the problem....');
    }
} finally {

    await server.close();
    console.log('proxy server close');
}

})();

jeff1998-git avatar May 29 '25 14:05 jeff1998-git

again,Use the same code as above,test more website, result:

// const url = 'https://www.google.com';     // use proxy is fail    && no proxy is success  ||  Fiddler port->8888 success
// const url = 'https://www.youtube.com/';   // use proxy is fail    && no proxy is success  ||  Fiddler port->8888 success
// const url = 'https://www.baidu.com/';     // use proxy is success && no proxy is success  ||  Fiddler port->8888 success
// const url = 'https://www.163.com/'        // use proxy is success && no proxy is success  ||  Fiddler port->8888 success
// const url = 'https://www.bing.com/';      // use proxy is success && no proxy is success  ||  Fiddler port->8888 success
// const url = 'https://www.hbhtcm.com';     // use proxy is fail    && no proxy is success  ||  Fiddler port->8888 success

// Fiddler listen in port->8888 // use fiddler soft listen in port 8888 const server = new ProxyChainServer({ port: 8888, prepareRequestFunction: ({ request }) => { return { upstreamProxyUrl: null, }; }, });

Using Playwright, Fiddler soft can capture data packets, but using the proxy-chain proxy module, Fiddler cannot capture data packets When using the playwrightcrawler module, Fiddler cannot capture the data packets. I speculate that the playwrightcrawler module should be encapsulated with proxy-chain, so that Fiddler cannot retrieve the data packets from the playwrightcrawler module,

It can be inferred that this type of proxy, when processing certain responses, missed something, causing page.goto: net:: ERR-EMPTYResponse. However, by directing the proxy port to Fiddler's working port 8888 and passing it through Fiddler, Fiddler supplemented and processed these issues, making the work normal again,

Although I don't know what processing the proxy module has done, through these tests, it can be known that the proxy chain's work that was not handled properly has become normal again through Fiddler processing

It may also be due to differences in the network. If playwrightcrawler can obtain data packets through intermediaries, it should be able to discover something

jeff1998-git avatar May 29 '25 17:05 jeff1998-git

Thank you for your extensive experiments!

I can access all of the servers using proxy-chain, expect for www.hbhtcm.com.

However, I looked more into this and found the likely culprit.

The proxy < === > target server socket created in proxy-chain utilizes the "happy eyeballs" algorithm for IP family autodetection (see the autoSelectFamily param here). This means that Node gets all the possible A and AAAA DNS records for the given hostname and starts trying to connect to the addresses one by one. If the connection is not established in autoSelectFamilyAttemptTimeout milliseconds, Node will try the next available address.

You can see this if you set NODE_DEBUG=net environment variable for your Node process with the proxy server:

Image

You can see that the IPv6 address is completely unreachable for me, likely due to some geographical blocking. The IPv4 address seems to be reachable, but the connection times out in 250 ms (default value for autoSelectFamilyAttemptTimeout).

If you use proxy-chain as an intermediary proxy server (by e.g. setting the upstreamProxyUrl to your Fiddler instance), Node only redirects the stream of data to the next proxy (Fiddler), and it doesn't directly connect to the target server. Because Node knows the other proxy server by IP address, it doesn't utilize the autoSelectFamily algorithm, and the connection works.

If you can access these servers without proxy-chain locally, I would suggest experimenting with Node's --network-family-autoselection-attempt-timeout option. By launching your process as e.g. node --network-family-autoselection-attempt-timeout=10000 script.js, you allow Node.JS to take more time (10000 milliseconds) while connecting to the server, which might align your setup with the actual browser's behaviour.

Alternatively, you can turn this behaviour off by using the --no-network-family-autoselection Node option.

Please try these suggested fixes and let me know, if any of them helped. Cheers!

barjin avatar May 30 '25 09:05 barjin

Using the initial test code, add network listent:

use proxy-chain

[NetworkDebug]begin... ProxyServer[10197]: Listening... proxy running,port: 10197 ProxyServer[10197]: 0 | !!! Handling GET http://clients2.google.com/time/1/current?cup2key=9:ZlmC7nz0VXWABlwaWlQYo6vzvVxUEomLrNfDv0-215o&cup2hreq=e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 HTTP/1.1 ProxyServer[10197]: 0 | Using forward() [REQUEST] GET http://undefinedundefined ProxyServer[10197]: 1 | !!! Handling CONNECT redirector.gvt1.com:443 HTTP/1.1 ProxyServer[10197]: 1 | Using direct() => redirector.gvt1.com:443 NET 1: createConnection redirector.gvt1.com:443 NET 1: connect: find host redirector.gvt1.com NET 1: connect: dns options { family: undefined, hints: 0 } NET 1: connect: autodetecting NET 1: _read - n 16384 isConnecting? true hasHandle? true NET 1: _read wait for connection NET 1: dns lookup result: 2401:3800:4002:804::1001 NET 1: afterConnect NET 1: connect: attempting to connect to redirector.gvt1.com:443 (addressType: IPv4) NET 1: _read - n 16384 isConnecting? false hasHandle? true NET 1: _read - n 16384 isConnecting? false hasHandle? true NET 1: _read - n 16384 isConnecting? false hasHandle? true NET 1: _read - n 16384 isConnecting? false hasHandle? true NET 1: _read - n 16384 isConnecting? false hasHandle? true NET 1: _read - n 16384 isConnecting? false hasHandle? true ProxyServer[10197]: 2 | !!! Handling CONNECT accounts.google.com:443 HTTP/1.1 ProxyServer[10197]: 2 | Using direct() => accounts.google.com:443 NET 2: createConnection accounts.google.com:443 NET 2: connect: find host accounts.google.com NET 2: connect: dns options { family: undefined, hints: 0 } NET 2: connect: autodetecting NET 2: _read - n 16384 isConnecting? true hasHandle? true NET 2: _read wait for connection NET 2: dns lookup result: 46.82.174.69 page accessing: https://www.hbhtcm.com ProxyServer[10197]: 3 | !!! Handling CONNECT www.hbhtcm.com:443 HTTP/1.1 ProxyServer[10197]: 3 | Using direct() => www.hbhtcm.com:443 NET 3: createConnection www.hbhtcm.com:443 NET 3: connect: find host www.hbhtcm.com NET 3: connect: dns options { family: undefined, hints: 0 } NET 3: connect: autodetecting NET 3: _read - n 16384 isConnecting? true hasHandle? true NET 3: _read wait for connection [PLAYWRIGHT REQUEST] https://www.hbhtcm.com/ NET 3: dns lookup result: 240e:668:b03::3 NET 1: _read - n 16384 isConnecting? false hasHandle? true NET 1: _read - n 16384 isConnecting? false hasHandle? true ProxyServer[10197]: 4 | !!! Handling CONNECT r9---sn-ni57dn76.gvt1-cn.com:443 HTTP/1.1 ProxyServer[10197]: 4 | Using direct() => r9---sn-ni57dn76.gvt1-cn.com:443 NET 4: createConnection r9---sn-ni57dn76.gvt1-cn.com:443 NET 4: connect: find host r9---sn-ni57dn76.gvt1-cn.com NET 4: connect: dns options { family: undefined, hints: 0 } NET 4: connect: autodetecting NET 4: _read - n 16384 isConnecting? true hasHandle? true NET 4: _read wait for connection NET 4: dns lookup result: 2401:3800:4002:8::1b NET 4: afterConnect NET 4: connect: attempting to connect to r9---sn-ni57dn76.gvt1-cn.com:443 (addressType: IPv4) NET 4: _read - n 16384 isConnecting? false hasHandle? true NET 4: _read - n 16384 isConnecting? false hasHandle? true NET 4: _read - n 16384 isConnecting? false hasHandle? true NET 4: _read - n 16384 isConnecting? false hasHandle? true ProxyServer[10197]: 5 | !!! Handling CONNECT dl.google.com:443 HTTP/1.1 ProxyServer[10197]: 5 | Using direct() => dl.google.com:443 NET 5: createConnection dl.google.com:443 NET 5: connect: find host dl.google.com NET 5: connect: dns options { family: undefined, hints: 0 } NET 5: connect: autodetecting NET 5: _read - n 16384 isConnecting? true hasHandle? true NET 5: _read wait for connection NET 5: dns lookup result: 120.253.253.97 NET 4: onclose NET 5: afterConnect NET 5: connect: attempting to connect to dl.google.com:443 (addressType: IPv4) NET 5: _read - n 16384 isConnecting? false hasHandle? true NET 5: _read - n 16384 isConnecting? false hasHandle? true NET 5: _read - n 16384 isConnecting? false hasHandle? true NET 5: _read - n 16384 isConnecting? false hasHandle? true NET 5: _read - n 16384 isConnecting? false hasHandle? true NET 5: _read - n 16384 isConnecting? false hasHandle? true NET 2: onerror connect ETIMEDOUT 46.82.174.69:443 ProxyServer[10197]: 2 | Direct Destination Socket Error: Error: connect ETIMEDOUT 46.82.174.69:443 at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1615:16) at TCPConnectWrap.callbackTrampoline (node:internal/async_hooks:130:17) NET 2: onclose NET 3: onerror connect ETIMEDOUT 240e:668:b03::3:443 ProxyServer[10197]: 3 | Direct Destination Socket Error: Error: connect ETIMEDOUT 240e:668:b03::3:443 at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1615:16) at TCPConnectWrap.callbackTrampoline (node:internal/async_hooks:130:17) NET 3: onclose 请求失败: https://www.hbhtcm.com/ { errorText: 'net::ERR_EMPTY_RESPONSE' } Use proxy Navigation failed: page.goto: net::ERR_EMPTY_RESPONSE at https://www.hbhtcm.com/ Call log:

  • navigating to "https://www.hbhtcm.com/", waiting until "load"

no use proxy-chain

ProxyServer[10197]: 6 | !!! Handling CONNECT accounts.google.com:443 HTTP/1.1 ProxyServer[10197]: 6 | Using direct() => accounts.google.com:443 NET 6: createConnection accounts.google.com:443 NET 6: connect: find host accounts.google.com NET 6: connect: dns options { family: undefined, hints: 0 } NET 6: connect: autodetecting NET 6: _read - n 16384 isConnecting? true hasHandle? true NET 6: _read wait for connection NET 6: dns lookup result: 46.82.174.69 [DIRECT RESPONSE] 200 https://www.hbhtcm.com/ from 221.232.157.76:443 [DIRECT RESPONSE] 200 https://www.hbhtcm.com/theme/default/css/jquery.mmenu.all.css from 221.232.157.76:443 [DIRECT RESPONSE] 200 https://www.hbhtcm.com/theme/default/css/swiper-3.4.0.min.css from 221.232.157.76:443 [DIRECT RESPONSE] 200 https://www.hbhtcm.com/theme/default/css/common.css from 221.232.157.76:443 [DIRECT RESPONSE] 200 https://www.hbhtcm.com/theme/default/js/jquery-1.11.3.min.js from 221.232.157.76:443 [DIRECT RESPONSE] 200 https://www.hbhtcm.com/theme/default/js/swiper-3.4.0.jquery.min.js from 221.232.157.76:443 [DIRECT RESPONSE] 200 https://www.hbhtcm.com/theme/default/js/jquery.mmenu.min.all.js from 221.232.157.76:443 ..... [NetworkDebug]end...

After testing, the target website hbhtcm.com only supports IPv4 and does not support IPv6, which is a common occurrence in practice

But the test results showed that the proxy-chain proxy link only returned IPv6, did not correctly fallback to IPv4.
In the proxy-chain proxy link, “happy eyeballs”did not take effect

In the proxy-chain source code, I see the parameter: ipFamily? : number; But testing ipFamily: 4, did not take effect

If IPv4 is specified globally in the node, it does not conform to the actual situation,

IPv6 --> IPv4 was not properly resolved in proxy-chain, so Fiddler's listening port 8888 was used as a proxy in previous tests and worked properly

If want to completely solve this problem, need to reprocess the proxy-chain link, currently there is not much time to deal with this issue

In Crawlee, the same problem is handled in the same way

The fastest way to solve the problem now is to add other proxy modules as intermediaries to handle the issue of IPv6 fallback to IPv4. This is currently the simplest method

hope can solve this problem as soon as possible,thanks

jeff1998-git avatar May 31 '25 16:05 jeff1998-git