Form POST request returns error page using BasicCrawler, but works when using `node-fetch`
Which package is this bug report for? If unsure which one to select, leave blank
@crawlee/basic (BasicCrawler)
Issue description
I am trying to scrap the resulting page from a simple form submission (it has no fields actually, an empty body) using a POST request.
I submitted this form using Postman, and it worked perfectly. I tried to run it using the node-fetch library, and it also worked perfectly.
However, when I tried to do the same using BasicCrawler, I got an error page from the website (with HTTP 200 status, but the content says there is an error). I attach the code in the two versions: using fetch and BasicCrawler.
You can compare the length of the two to see the difference. The error page is 13,340 characters long, while the correct page has 849,470 characters.
Code sample
// BasicCrawler
import {BasicCrawler} from "crawlee";
const crawler = new BasicCrawler({
async requestHandler({sendRequest, log}) {
const {body} = await sendRequest({
'method': 'POST',
'url': 'https://www.idealo.de/hp/prg/bargains',
'headers': {
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:128.0) Gecko/20100101 Firefox/128.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/png,image/svg+xml,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br, zstd',
'Content-Type': 'application/x-www-form-urlencoded',
'Content-Length': '0',
'Origin': 'https://www.idealo.de',
'Connection': 'keep-alive',
'Referer': 'https://www.idealo.de/',
'Upgrade-Insecure-Requests': '1',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-User': '?1',
'Priority': 'u=0, i',
'TE': 'trailers'
},
'body': ''
});
log.info(body.length);
}
});
crawler.run(['https://www.idealo.de/hp/prg/bargains']);
// fetch (code exported from Postman)
import fetch from "node-fetch";
const myHeaders = new Headers();
myHeaders.append("User-Agent", "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:128.0) Gecko/20100101 Firefox/128.0");
myHeaders.append("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/png,image/svg+xml,*/*;q=0.8");
myHeaders.append("Accept-Language", "en-US,en;q=0.5");
myHeaders.append("Accept-Encoding", "gzip, deflate, br, zstd");
myHeaders.append("Content-Type", "application/x-www-form-urlencoded");
myHeaders.append("Content-Length", "0");
myHeaders.append("Origin", "https://www.idealo.de");
myHeaders.append("Connection", "keep-alive");
myHeaders.append("Referer", "https://www.idealo.de/");
myHeaders.append("Upgrade-Insecure-Requests", "1");
myHeaders.append("Sec-Fetch-Dest", "document");
myHeaders.append("Sec-Fetch-Mode", "navigate");
myHeaders.append("Sec-Fetch-Site", "same-origin");
myHeaders.append("Sec-Fetch-User", "?1");
myHeaders.append("Priority", "u=0, i");
myHeaders.append("TE", "trailers");
const requestOptions = {
method: "POST",
headers: myHeaders,
redirect: "follow"
};
fetch("https://www.idealo.de/hp/prg/bargains", requestOptions)
.then((response) => response.text())
.then((result) => console.log(result.length))
.catch((error) => console.error(error));
Package version
3.11.0
Node.js version
v20.15.1
Operating system
Ubuntu 22.04
Apify platform
- [ ] Tick me if you encountered this issue on the Apify platform
I have tested this on the next release
No response
Other context
No response
Hello, and thank you for your interest in this project!
This happens because Crawlee resends the redirected request with the same method (POST in this case), while e.g. node-fetch (or undici) changes the method to GET.
Crawlee:
> POST https://www.idealo.de/hp/prg/bargains
< 301 Redirected
< Location: https://www.idealo.de/preisvergleich/MainSearchProductCategory/100oE0oJ4.html
> POST https://www.idealo.de/preisvergleich/MainSearchProductCategory/100oE0oJ4.html
< 404 Not Found
fetch:
> POST https://www.idealo.de/hp/prg/bargains
< 301 Redirected
< Location: https://www.idealo.de/preisvergleich/MainSearchProductCategory/100oE0oJ4.html
> GET https://www.idealo.de/preisvergleich/MainSearchProductCategory/100oE0oJ4.html
< 301 Redirected
...
...
< 200 OK
The specification is ambiguous here - it allows both switching the method to GET and using the same method for the redirected request. Since most major user agents switch the method on redirect (e.g. Chrome or Firefox as well), we imo should support this in Crawlee as well.
Thanks for the detailed answer @barjin . I understand.
IMO, given that Crawlee is a web scraping library that should mimic the browser behavior, and since you mentioned that major browsers switch the HTTP method to GET , Crawlee should follow that pattern, no matter what the standard specifications are.