headless-chrome-crawler icon indicating copy to clipboard operation
headless-chrome-crawler copied to clipboard

Get current URL in customCrawl()

Open popstas opened this issue 4 years ago • 3 comments

What is the current behavior? No information about current URL in customCrawl()

What is the motivation / use case for changing the behavior? I'm want to skip request, but add URL to csv for some files like zip, doc, pdf. My code that do it - https://github.com/viasite/sites-scraper/blob/59449b1b03/src/scrap-site.js#L240-L255

Proposal Add crawler to customCrawl: customCrawl: async (page, crawl, crawler)

I tried to store currentURL with requeststarted event, but it fail when more when concurrency > 1.

What do you think about it? I can make PR.

popstas avatar Apr 27 '20 13:04 popstas

Hey @popstas This is a valid proposal. I had the same issue. Yeah, pls do the PR. Also pls do not forget to add related info to docs. It was a while since you've posted this, so, pls let me know if you are still willing to do this.

kulikalov avatar Oct 17 '20 06:10 kulikalov

We can use preRequest option to skip urls. we can persist or do anything to the url in there

iamprageeth avatar Jun 19 '22 06:06 iamprageeth

2 years since the issue was opened, but if others in the future are looking to get the current URL, it's available in the result object of a customCrawl. Specifically result.options.url. Something like this should do the trick:

customCrawl: async (page, crawl) => {
    await page.setRequestInterception(true);
    await page.on('request', async request => await request.continue());
    await page.on('error', async err => console.log(new Error(err)));

    const result = await crawl();
    const currentUrl = result.options.url;
    // ... whatever logic you want
    return result;
}

JacksonSabol avatar Jul 09 '22 18:07 JacksonSabol