crawlee
crawlee copied to clipboard
RequestList messes up URLs containing characters like ' or * when populated with requestsFromUrl
Which package is this bug report for? If unsure which one to select, leave blank
None
Issue description
RequestList seems to mess with URLs containing characters like ' or * when populated with requestsFromUrl. It uses regex to grab the URLs but the URLs might not follow proper spec.
Code sample
import { RequestList } from '@crawlee/core';
const startUrls1 = await RequestList.open('startUrls1', [
{
"requestsFromUrl": "https://pastebin.com/raw/VHLFnh2h"
}
]);
// this is just like above, but directly without an external file
const startUrls2 = await RequestList.open('startUrls2', [
{
"url": "https://www.zillow.com/homedetails/141-O'Canoe-Pl-Hampton-VA-23661/74398007_zpid/"
}
]);
console.log(startUrls1.requests);
console.log(startUrls2.requests);
Package version
v3.12.0
Node.js version
20.18.1
Operating system
No response
Apify platform
- [ ] Tick me if you encountered this issue on the Apify platform
I have tested this on the next release
No response
Other context
No response
Please keep the link to the slack thread in such reports so we have the additional context too.
https://apify.slack.com/archives/C0L33UM7Z/p1732817024848529