crawlee icon indicating copy to clipboard operation
crawlee copied to clipboard

Duplicate Requests Being Processed

Open jackHedaya opened this issue 2 years ago • 5 comments

Which package is this bug report for? If unsure which one to select, leave blank

@crawlee/core

Issue description

I noticed two things with regard to urls being processed once. First, when provided a starting url to be NYU's site, I get the following:

INFO  PlaywrightCrawler: Starting the crawl
INFO  PlaywrightCrawler: NYU {"url":"https://www.nyu.edu/"}
INFO  PlaywrightCrawler: NYU {"url":"https://www.nyu.edu/"}

I also get two items written to the dataset.

Secondly, I'm noticing that a hash affects the uniqueness of a url. How can I completely strip a url to its barest form before it goes through the uniqueKey function? I tried to use transformRequest but ended up with 16k duplicates, so looking for the official solution.

Code sample

const crawler = new PlaywrightCrawler({
  requestHandler: router,
})

await crawler.run(["https://www.nyu.edu"])

/// router file

export const router = createPlaywrightRouter()

router.addDefaultHandler(async ({ request, enqueueLinks, page, log }) => {
  const title = await page.title()

  // process

  log.info(`${title}`, { url: request.loadedUrl })

  await Dataset.pushData({
    url: request.loadedUrl,
    title,
    page: doc,
  })

  await enqueueLinks()
})

Package version

3.2.2

Node.js version

v18.12

Operating system

M1 Mac

Apify platform

  • [ ] Tick me if you encountered this issue on the Apify platform

I have tested this on the next release

No response

Other context

No response

jackHedaya avatar Feb 28 '23 07:02 jackHedaya

~Not reproducible, I doubt we would have such a bug reproducible without anything special - our devs who use crawlee on daily basis would tell us already...~

Actually it is reproducible

B4nan avatar Feb 28 '23 08:02 B4nan

The problem here is that the page contains http link to itself, you are logging loadedUrl which is what you get after redirects, but if you log just url, you can see what is happening:

INFO  PlaywrightCrawler: Starting the crawl
INFO  PlaywrightCrawler: NYU {"url":"https://www.nyu.edu/","loadedUrl":"https://www.nyu.edu/"}
INFO  PlaywrightCrawler: NYU {"url":"http://www.nyu.edu/","loadedUrl":"https://www.nyu.edu/"}

cc @vladfrangu, this needs to be handled automatically somehow, any idea where?

B4nan avatar Feb 28 '23 08:02 B4nan

Oof, this is a tough one.

any idea where

Its either at the request queue level or the enqueue links level, but its tough, since the urls technically do not match between initial url and found one 😅

I guess we should be ignoring same urls regardless of schema (which sounds more like a request queue change)? Thoughts?

vladfrangu avatar Feb 28 '23 09:02 vladfrangu

@B4nan That makes sense, thank you for figuring it out!

@vladfrangu I think you're on point with a great first step. Better normalizing the url would definitely help in most cases.

However, in the event of a real redirect happening, this wouldn't help. Maybe at some point there can be a feature that opts in post redirect checks that can exit early.

jackHedaya avatar Mar 10 '23 15:03 jackHedaya

@vladfrangu @jackHedaya do you have any update on this issue? We are running into the same issue. Thanks a lot!

boehlerlukas avatar Jan 08 '24 10:01 boehlerlukas