crawlee icon indicating copy to clipboard operation
crawlee copied to clipboard

Form POST request returns error page using BasicCrawler, but works when using `node-fetch`

Open Hamza5 opened this issue 1 year ago • 2 comments

Which package is this bug report for? If unsure which one to select, leave blank

@crawlee/basic (BasicCrawler)

Issue description

I am trying to scrap the resulting page from a simple form submission (it has no fields actually, an empty body) using a POST request.

I submitted this form using Postman, and it worked perfectly. I tried to run it using the node-fetch library, and it also worked perfectly.

However, when I tried to do the same using BasicCrawler, I got an error page from the website (with HTTP 200 status, but the content says there is an error). I attach the code in the two versions: using fetch and BasicCrawler.

You can compare the length of the two to see the difference. The error page is 13,340 characters long, while the correct page has 849,470 characters.

Code sample

// BasicCrawler

import {BasicCrawler} from "crawlee";

const crawler = new BasicCrawler({
    async requestHandler({sendRequest, log}) {
        const {body} = await sendRequest({
            'method': 'POST',
            'url': 'https://www.idealo.de/hp/prg/bargains',
            'headers': {
                'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:128.0) Gecko/20100101 Firefox/128.0',
                'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/png,image/svg+xml,*/*;q=0.8',
                'Accept-Language': 'en-US,en;q=0.5',
                'Accept-Encoding': 'gzip, deflate, br, zstd',
                'Content-Type': 'application/x-www-form-urlencoded',
                'Content-Length': '0',
                'Origin': 'https://www.idealo.de',
                'Connection': 'keep-alive',
                'Referer': 'https://www.idealo.de/',
                'Upgrade-Insecure-Requests': '1',
                'Sec-Fetch-Dest': 'document',
                'Sec-Fetch-Mode': 'navigate',
                'Sec-Fetch-Site': 'same-origin',
                'Sec-Fetch-User': '?1',
                'Priority': 'u=0, i',
                'TE': 'trailers'
            },
            'body': ''
        });
        log.info(body.length);
    }
});

crawler.run(['https://www.idealo.de/hp/prg/bargains']);

// fetch (code exported from Postman)

import fetch from "node-fetch";

const myHeaders = new Headers();
myHeaders.append("User-Agent", "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:128.0) Gecko/20100101 Firefox/128.0");
myHeaders.append("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/png,image/svg+xml,*/*;q=0.8");
myHeaders.append("Accept-Language", "en-US,en;q=0.5");
myHeaders.append("Accept-Encoding", "gzip, deflate, br, zstd");
myHeaders.append("Content-Type", "application/x-www-form-urlencoded");
myHeaders.append("Content-Length", "0");
myHeaders.append("Origin", "https://www.idealo.de");
myHeaders.append("Connection", "keep-alive");
myHeaders.append("Referer", "https://www.idealo.de/");
myHeaders.append("Upgrade-Insecure-Requests", "1");
myHeaders.append("Sec-Fetch-Dest", "document");
myHeaders.append("Sec-Fetch-Mode", "navigate");
myHeaders.append("Sec-Fetch-Site", "same-origin");
myHeaders.append("Sec-Fetch-User", "?1");
myHeaders.append("Priority", "u=0, i");
myHeaders.append("TE", "trailers");

const requestOptions = {
    method: "POST",
    headers: myHeaders,
    redirect: "follow"
};

fetch("https://www.idealo.de/hp/prg/bargains", requestOptions)
    .then((response) => response.text())
    .then((result) => console.log(result.length))
    .catch((error) => console.error(error));

Package version

3.11.0

Node.js version

v20.15.1

Operating system

Ubuntu 22.04

Apify platform

  • [ ] Tick me if you encountered this issue on the Apify platform

I have tested this on the next release

No response

Other context

No response

Hamza5 avatar Jul 22 '24 11:07 Hamza5

Hello, and thank you for your interest in this project!

This happens because Crawlee resends the redirected request with the same method (POST in this case), while e.g. node-fetch (or undici) changes the method to GET.

Crawlee:

> POST https://www.idealo.de/hp/prg/bargains
< 301 Redirected
< Location: https://www.idealo.de/preisvergleich/MainSearchProductCategory/100oE0oJ4.html

> POST https://www.idealo.de/preisvergleich/MainSearchProductCategory/100oE0oJ4.html
< 404 Not Found

fetch:

> POST https://www.idealo.de/hp/prg/bargains
< 301 Redirected
< Location: https://www.idealo.de/preisvergleich/MainSearchProductCategory/100oE0oJ4.html

> GET https://www.idealo.de/preisvergleich/MainSearchProductCategory/100oE0oJ4.html
< 301 Redirected
...
...
< 200 OK

The specification is ambiguous here - it allows both switching the method to GET and using the same method for the redirected request. Since most major user agents switch the method on redirect (e.g. Chrome or Firefox as well), we imo should support this in Crawlee as well.

barjin avatar Nov 20 '25 10:11 barjin

Thanks for the detailed answer @barjin . I understand.

IMO, given that Crawlee is a web scraping library that should mimic the browser behavior, and since you mentioned that major browsers switch the HTTP method to GET , Crawlee should follow that pattern, no matter what the standard specifications are.

Hamza5 avatar Nov 20 '25 15:11 Hamza5