crawlee icon indicating copy to clipboard operation
crawlee copied to clipboard

RequestList messes up URLs containing characters like ' or * when populated with requestsFromUrl

Open webrdaniel opened this issue 1 year ago • 1 comments

Which package is this bug report for? If unsure which one to select, leave blank

None

Issue description

RequestList seems to mess with URLs containing characters like ' or * when populated with requestsFromUrl. It uses regex to grab the URLs but the URLs might not follow proper spec.

Code sample

import { RequestList } from '@crawlee/core';

const startUrls1 = await RequestList.open('startUrls1', [
    {
        "requestsFromUrl": "https://pastebin.com/raw/VHLFnh2h"
    }
]);
// this is just like above, but directly without an external file
const startUrls2 = await RequestList.open('startUrls2', [
    {
        "url": "https://www.zillow.com/homedetails/141-O'Canoe-Pl-Hampton-VA-23661/74398007_zpid/"
    }
]);
console.log(startUrls1.requests);
console.log(startUrls2.requests);

Package version

v3.12.0

Node.js version

20.18.1

Operating system

No response

Apify platform

  • [ ] Tick me if you encountered this issue on the Apify platform

I have tested this on the next release

No response

Other context

No response

webrdaniel avatar Nov 29 '24 07:11 webrdaniel

Please keep the link to the slack thread in such reports so we have the additional context too.

https://apify.slack.com/archives/C0L33UM7Z/p1732817024848529

B4nan avatar Nov 29 '24 07:11 B4nan