gpt-crawler icon indicating copy to clipboard operation
gpt-crawler copied to clipboard

Adds for autoScroll for crawling the multi pages?

Open SOSONAGI opened this issue 1 year ago • 3 comments

I just worked for our platform pages with origin code and that couldn't provide me full information on pages.

Therefore, i added autoScroll code in main.ts for this and it worked perfectly. (I think it is better than increasing the numbers of waitForSelectorTimeout.)

async function autoScroll(page: Page) {
  await page.evaluate(async () => {
    await new Promise<void>((resolve, reject) => {
      var totalHeight = 0;
      var distance = 100;
      var timer = setInterval(() => {
        var scrollHeight = document.body.scrollHeight;
        window.scrollBy(0, distance);
        totalHeight += distance;

        if (totalHeight >= scrollHeight) {
          clearInterval(timer);
          resolve();
        }
      }, 100);
    });
  });
}

if (process.env.NO_CRAWL !== "true") {
  const crawler = new PlaywrightCrawler({
    async requestHandler({ request, page, enqueueLinks, log, pushData }) {
      try {
        if (config.cookie) {
          const cookie = {
            name: config.cookie.name,
            value: config.cookie.value,
            url: request.loadedUrl, 
          };
          await page.context().addCookies([cookie]);
        }

        const title = await page.title();
        log.info(`Crawling ${request.loadedUrl}...`);

        await page.waitForSelector(config.selector, {
          timeout: config.waitForSelectorTimeout,
        });

        await autoScroll(page);  

        const html = await getPageHtml(page);
        await pushData({ title, url: request.loadedUrl, html });

        if (config.onVisitPage) {
          await config.onVisitPage({ page, pushData });
        }

        await enqueueLinks({
          globs: [config.match],
        });
      } catch (error) {
        log.error(`Error crawling ${request.loadedUrl}: ${error}`);
      }
    },
    maxRequestsPerCrawl: config.maxPagesToCrawl,
    // headless: false,
  });

  await crawler.run([config.url]);
}

If you think this is good enough for crawling, hope this will be helpful for other users.

Thank you for your work btw!

I really appreciate for that!

Thank you.

SOSONAGI avatar Nov 19 '23 22:11 SOSONAGI

Thank your code. How can I add this code to repo? Can you share completed code?

franklili3 avatar Dec 14 '23 09:12 franklili3

Just go to main.ts and copy and add above code! And run it!

SOSONAGI avatar Dec 16 '23 10:12 SOSONAGI

Thanks.

---Original--- From: "SUN YOUNG @.> Date: Sat, Dec 16, 2023 18:10 PM To: @.>; Cc: @.@.>; Subject: Re: [BuilderIO/gpt-crawler] Adds for autoScroll for crawling themulti pages? (Issue #30)

Just go to main.ts and copy and add above code! And run it!

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

franklili3 avatar Dec 16 '23 11:12 franklili3