crawlee
crawlee copied to clipboard
Improve validation: "Reclaiming failed request back to the list or queue. Received one or more errors"
Which package is this bug report for? If unsure which one to select, leave blank
@crawlee/playwright (PlaywrightCrawler)
Issue description
Error message:
INFO PlaywrightCrawler: Starting the crawl
crates.io: Rust Package Registry
WARN PlaywrightCrawler: Reclaiming failed request back to the list or queue. Received one or more errors
{"id":"S0zjxs8GeCqFpPk","url":"https://crates.io/crates/syn","retryCount":1}
crates.io: Rust Package Registry
WARN PlaywrightCrawler: Reclaiming failed request back to the list or queue. Received one or more errors
{"id":"S0zjxs8GeCqFpPk","url":"https://crates.io/crates/syn","retryCount":2}
crates.io: Rust Package Registry
WARN PlaywrightCrawler: Reclaiming failed request back to the list or queue. Received one or more errors
{"id":"S0zjxs8GeCqFpPk","url":"https://crates.io/crates/syn","retryCount":3}
crates.io: Rust Package Registry
ERROR PlaywrightCrawler: Request failed and reached maximum retries. Error: Received one or more errors
at ArrayValidator.handle (C:\Users\essel_r\desktop\dev\backend\node_modules\@sapphire\shapeshift\src\validators\ArrayValidator.ts:102:17)
at ArrayValidator.parse (C:\Users\essel_r\desktop\dev\backend\node_modules\@sapphire\shapeshift\src\validators\BaseValidator.ts:103:2)
at RequestQueueClient.batchAddRequests (C:\Users\essel_r\desktop\dev\backend\node_modules\@crawlee\src\resource-clients\request-queue.ts:338:36)
at RequestQueue.addRequests (C:\Users\essel_r\desktop\dev\backend\node_modules\@crawlee\src\storages\request_queue.ts:375:46)
at attemptToAddToQueueAndAddAnyUnprocessed (C:\Users\essel_r\desktop\dev\backend\node_modules\@crawlee\src\internals\basic-crawler.ts:767:50)
at PlaywrightCrawler.addRequests (C:\Users\essel_r\desktop\dev\backend\node_modules\@crawlee\src\internals\basic-crawler.ts:784:37) {"id":"S0zjxs8GeCqFpPk","url":"https://crates.io/crates/syn","method":"GET","uniqueKey":"https://crates.io/crates/syn"}
INFO PlaywrightCrawler: All requests from the queue have been processed, the crawler will shut down.
INFO PlaywrightCrawler: Crawl finished. Final request statistics: {"requestsFinished":0,"requestsFailed":1,"retryHistogram":[null,null,null,1],"requestAvgFailedDurationMillis":5422,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":1,"requestTotalDurationMillis":5422,"requestsTotal":1,"crawlerRuntimeMillis":33637}
INFO PlaywrightCrawler: Error analysis: {"totalErrors":1,"uniqueErrors":1,"mostCommonErrors":["1x: Received one or more errors
(C:\\Users\\essel_r\\desktop\\dev\\backend\\node_modules\\@sapphire\\shapeshift\\src\\validators\\ArrayValidator.ts:102:17)"]}
POST /v1/api/packages 200 - - 36601.243 ms
I truly don't know what happen, it was working perfectly when I testing, and now when I create a middleware to scrape the data before sending it to response, then I started getting this error. I'm using TypeScript, so I thought maybe it is not typescript(even it is), I did it in JS and got another error about the file extension.
Code sample
export const crates_rpm = async (package_name:string) => {
const crawler = new PlaywrightCrawler({
requestHandler: async ({ request, page }) => {
const title = await page.title();
console.log(title);
if (request.label === 'repository') {
const commitText = await page.getByRole('listitem').filter({ hasText: 'commits' }).textContent();
const stars = await page.$$eval('main', (github) => github.map((star) => star?.querySelector('div.Layout-sidebar > div.BorderGrid--spacious > div.hide-sm > div.BorderGrid-cell')?.querySelector('div > a > strong')?.textContent));
const numberStrings = commitText?.match(/\d+/g);
const commitCount = Number(numberStrings?.join(''));
console.log(commitCount);
await Dataset.pushData({
...request.userData,
commitCount,
stars: stars[0],
});
} else {
await page.waitForFunction(() => document.querySelector('ul._list_181lzn > li'));
const main_package = await page.$$eval('main', (package_lib) => {
return package_lib.map((package_el) => {
const package_name = package_el.querySelector('h1._heading_8qtlic > span')?.textContent;
const keywords = Array.from(package_el.querySelectorAll('ul._keywords_8qtlic > li')).map((keyList) => keyList?.textContent?.trim());
const toText = (element:HTMLElement) => element && element.textContent?.trim();
const toLink = (element:HTMLElement) => element && element.getAttribute('src');
const time_uploaded = package_el?.querySelector('time._date_rj9k29')?.getAttribute('datetime');
const license_name = package_el?.querySelector('div._license_rj9k29 > span > a')?.textContent;
const license_url = package_el?.querySelector('div._license_rj9k29 > span > a')?.getAttribute('href');
const version = package_el?.querySelector('small')?.innerText;
const description = package_el?.querySelector('div._description_8qtlic')?.textContent?.trim();
const total_downloads = package_el.querySelector('span._num__align_87huyj')?.textContent?.trim();
const license = { type: license_name?.trim(), url: license_url };
const links = Array.from(package_el.querySelectorAll('div._links_rj9k29 > div')).map((link) => {
return {
name: link.querySelector('div > h2._heading_rj9k29')?.innerHTML,
URL: link.querySelector('div._content_t2rnmm > a._link_t2rnmm')?.getAttribute('href'),
};
});
const owners = Array.from(package_el?.querySelectorAll('ul._list_181lzn > li')).map((owners) => {
const name = owners?.querySelector('a > span')?.textContent?.trim();
const profile = owners?.querySelector('img')?.getAttribute('src');
const user_url = owners?.querySelector('a')?.getAttribute('href');
return { name, profile, user_url };
});
const package_obj = { keywords, package_name, time_uploaded, license, version, description, links, total_downloads, owners}
return package_obj
});
});
const requests = main_package.map(($package) =>
new Request({
url: `'${$package.links?.find((github) => github?.name === "Repository")?.URL}'`,
label: 'repository',
userData: $package,
})
);
await crawler.addRequests(requests);
await Dataset.pushData(main_package[0]);
}
},
});
await crawler.run([`https://crates.io/crates/${package_name}`]);
const raw_dataset = await Dataset.getData();
const dataset = raw_dataset.items.pop();
return dataset;
};
Package version
Node.js version
v18.12.1
Operating system
Windows
Apify platform
- [X] Tick me if you encountered this issue on the Apify platform
I have tested this on the next
release
No response
Other context
No response
The error comes from a validation of the requests, so it has to be something about what you are passing to the crawler.addRequests
. Not sure why are you trying to create the Request
objects yourself, try to change that bit to the following code:
const requests = main_package.map(($package) => ({
// and double-check this is actually finding some links, as this can easily end up as `url: 'undefined'` which is not valid and would probably fail some validations too
url: $package.links?.find((github) => github?.name === "Repository")?.URL,
label: 'repository',
userData: $package,
}));
await crawler.addRequests(requests);
In general, I would suspect this part, try to log what is in the $package
object and what are you passing down to the crawler.addRequests()
.
cc @vladfrangu, that validation is obviously something to improve, we need to say what is wrong, not just something is wrong :]
I did it in JS and got another error about the file extension.
Sounds like some ESM-related issues, maybe you just forgot to put the .js
extension in your imports, as that is required with ESM.
cc @vladfrangu, that validation is obviously something to improve, we need to say what is wrong, not just something is wrong :]
We do have in-depth messages but they're logged when calling util.inspect on the error for now (I'm trying to get it to be the default message when logging error.message soon).
Also that URL is probably never going to be valid as its wrapped in single quotes. Dropping those should solve it too
Right, good catch, modified my suggested snippet to remove the quotes.
The error comes from a validation of the
Thanks, your right, but I don't understand why that was causing the error, because I was trying to crawl another link with he github hub of the package, so I can get the stars and commits, and other information later on.
But if there is another way to do that please help me. And thanks again. @vladfrangu
requestHandler: async ({ request, page }) => {
if (request.label === 'repository') {
const commitText = await page.getByRole('listitem').filter({ hasText: 'commits' }).textContent();
const stars = await page.$$eval('main', (github) => github.map((star) => star?.querySelector('div.Layout-sidebar > div.BorderGrid--spacious > div.hide-sm > div.BorderGrid-cell')?.querySelector('div > a > strong')?.textContent));
const numberStrings = commitText?.match(/\d+/g);
const commitCount = Number(numberStrings?.join(''));
console.log(commitCount);
await Dataset.pushData({
...request.userData,
commitCount,
stars: stars[0],
});
}
const requests = main_package.map(($package) =>
new Request({
url: `'${$package.links?.find((github) => github?.name === "Repository")?.URL}'`,
label: 'repository',
userData: $package,
})
);
await crawler.addRequests(requests);
}
Remove the single quotes from your url field! That should solve it