gpt-crawler
gpt-crawler copied to clipboard
Adds for autoScroll for crawling the multi pages?
I just worked for our platform pages with origin code and that couldn't provide me full information on pages.
Therefore, i added autoScroll code in main.ts for this and it worked perfectly. (I think it is better than increasing the numbers of waitForSelectorTimeout.)
async function autoScroll(page: Page) {
await page.evaluate(async () => {
await new Promise<void>((resolve, reject) => {
var totalHeight = 0;
var distance = 100;
var timer = setInterval(() => {
var scrollHeight = document.body.scrollHeight;
window.scrollBy(0, distance);
totalHeight += distance;
if (totalHeight >= scrollHeight) {
clearInterval(timer);
resolve();
}
}, 100);
});
});
}
if (process.env.NO_CRAWL !== "true") {
const crawler = new PlaywrightCrawler({
async requestHandler({ request, page, enqueueLinks, log, pushData }) {
try {
if (config.cookie) {
const cookie = {
name: config.cookie.name,
value: config.cookie.value,
url: request.loadedUrl,
};
await page.context().addCookies([cookie]);
}
const title = await page.title();
log.info(`Crawling ${request.loadedUrl}...`);
await page.waitForSelector(config.selector, {
timeout: config.waitForSelectorTimeout,
});
await autoScroll(page);
const html = await getPageHtml(page);
await pushData({ title, url: request.loadedUrl, html });
if (config.onVisitPage) {
await config.onVisitPage({ page, pushData });
}
await enqueueLinks({
globs: [config.match],
});
} catch (error) {
log.error(`Error crawling ${request.loadedUrl}: ${error}`);
}
},
maxRequestsPerCrawl: config.maxPagesToCrawl,
// headless: false,
});
await crawler.run([config.url]);
}
If you think this is good enough for crawling, hope this will be helpful for other users.
Thank you for your work btw!
I really appreciate for that!
Thank you.
Thank your code. How can I add this code to repo? Can you share completed code?
Just go to main.ts and copy and add above code! And run it!
Thanks.
---Original--- From: "SUN YOUNG @.> Date: Sat, Dec 16, 2023 18:10 PM To: @.>; Cc: @.@.>; Subject: Re: [BuilderIO/gpt-crawler] Adds for autoScroll for crawling themulti pages? (Issue #30)
Just go to main.ts and copy and add above code! And run it!
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>