Duplicate Requests Being Processed
Which package is this bug report for? If unsure which one to select, leave blank
@crawlee/core
Issue description
I noticed two things with regard to urls being processed once. First, when provided a starting url to be NYU's site, I get the following:
INFO PlaywrightCrawler: Starting the crawl
INFO PlaywrightCrawler: NYU {"url":"https://www.nyu.edu/"}
INFO PlaywrightCrawler: NYU {"url":"https://www.nyu.edu/"}
I also get two items written to the dataset.
Secondly, I'm noticing that a hash affects the uniqueness of a url. How can I completely strip a url to its barest form before it goes through the uniqueKey function? I tried to use transformRequest but ended up with 16k duplicates, so looking for the official solution.
Code sample
const crawler = new PlaywrightCrawler({
requestHandler: router,
})
await crawler.run(["https://www.nyu.edu"])
/// router file
export const router = createPlaywrightRouter()
router.addDefaultHandler(async ({ request, enqueueLinks, page, log }) => {
const title = await page.title()
// process
log.info(`${title}`, { url: request.loadedUrl })
await Dataset.pushData({
url: request.loadedUrl,
title,
page: doc,
})
await enqueueLinks()
})
Package version
3.2.2
Node.js version
v18.12
Operating system
M1 Mac
Apify platform
- [ ] Tick me if you encountered this issue on the Apify platform
I have tested this on the next release
No response
Other context
No response
~Not reproducible, I doubt we would have such a bug reproducible without anything special - our devs who use crawlee on daily basis would tell us already...~
Actually it is reproducible
The problem here is that the page contains http link to itself, you are logging loadedUrl which is what you get after redirects, but if you log just url, you can see what is happening:
INFO PlaywrightCrawler: Starting the crawl
INFO PlaywrightCrawler: NYU {"url":"https://www.nyu.edu/","loadedUrl":"https://www.nyu.edu/"}
INFO PlaywrightCrawler: NYU {"url":"http://www.nyu.edu/","loadedUrl":"https://www.nyu.edu/"}
cc @vladfrangu, this needs to be handled automatically somehow, any idea where?
Oof, this is a tough one.
any idea where
Its either at the request queue level or the enqueue links level, but its tough, since the urls technically do not match between initial url and found one 😅
I guess we should be ignoring same urls regardless of schema (which sounds more like a request queue change)? Thoughts?
@B4nan That makes sense, thank you for figuring it out!
@vladfrangu I think you're on point with a great first step. Better normalizing the url would definitely help in most cases.
However, in the event of a real redirect happening, this wouldn't help. Maybe at some point there can be a feature that opts in post redirect checks that can exit early.
@vladfrangu @jackHedaya do you have any update on this issue? We are running into the same issue. Thanks a lot!