crawlee icon indicating copy to clipboard operation
crawlee copied to clipboard

Improve validation: "Reclaiming failed request back to the list or queue. Received one or more errors"

Open rockyessel opened this issue 1 year ago • 5 comments

Which package is this bug report for? If unsure which one to select, leave blank

@crawlee/playwright (PlaywrightCrawler)

Issue description

Error message:

INFO  PlaywrightCrawler: Starting the crawl
crates.io: Rust Package Registry
WARN  PlaywrightCrawler: Reclaiming failed request back to the list or queue. Received one or more errors
 {"id":"S0zjxs8GeCqFpPk","url":"https://crates.io/crates/syn","retryCount":1}
crates.io: Rust Package Registry
WARN  PlaywrightCrawler: Reclaiming failed request back to the list or queue. Received one or more errors
 {"id":"S0zjxs8GeCqFpPk","url":"https://crates.io/crates/syn","retryCount":2}
crates.io: Rust Package Registry
WARN  PlaywrightCrawler: Reclaiming failed request back to the list or queue. Received one or more errors
 {"id":"S0zjxs8GeCqFpPk","url":"https://crates.io/crates/syn","retryCount":3}
crates.io: Rust Package Registry
ERROR PlaywrightCrawler: Request failed and reached maximum retries. Error: Received one or more errors
    at ArrayValidator.handle (C:\Users\essel_r\desktop\dev\backend\node_modules\@sapphire\shapeshift\src\validators\ArrayValidator.ts:102:17)
    at ArrayValidator.parse (C:\Users\essel_r\desktop\dev\backend\node_modules\@sapphire\shapeshift\src\validators\BaseValidator.ts:103:2)
    at RequestQueueClient.batchAddRequests (C:\Users\essel_r\desktop\dev\backend\node_modules\@crawlee\src\resource-clients\request-queue.ts:338:36)
    at RequestQueue.addRequests (C:\Users\essel_r\desktop\dev\backend\node_modules\@crawlee\src\storages\request_queue.ts:375:46)
    at attemptToAddToQueueAndAddAnyUnprocessed (C:\Users\essel_r\desktop\dev\backend\node_modules\@crawlee\src\internals\basic-crawler.ts:767:50)
    at PlaywrightCrawler.addRequests (C:\Users\essel_r\desktop\dev\backend\node_modules\@crawlee\src\internals\basic-crawler.ts:784:37) {"id":"S0zjxs8GeCqFpPk","url":"https://crates.io/crates/syn","method":"GET","uniqueKey":"https://crates.io/crates/syn"} 
INFO  PlaywrightCrawler: All requests from the queue have been processed, the crawler will shut down.
INFO  PlaywrightCrawler: Crawl finished. Final request statistics: {"requestsFinished":0,"requestsFailed":1,"retryHistogram":[null,null,null,1],"requestAvgFailedDurationMillis":5422,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":1,"requestTotalDurationMillis":5422,"requestsTotal":1,"crawlerRuntimeMillis":33637}
INFO  PlaywrightCrawler: Error analysis: {"totalErrors":1,"uniqueErrors":1,"mostCommonErrors":["1x: Received one or more errors 
(C:\\Users\\essel_r\\desktop\\dev\\backend\\node_modules\\@sapphire\\shapeshift\\src\\validators\\ArrayValidator.ts:102:17)"]}  
POST /v1/api/packages 200 - - 36601.243 ms

I truly don't know what happen, it was working perfectly when I testing, and now when I create a middleware to scrape the data before sending it to response, then I started getting this error. I'm using TypeScript, so I thought maybe it is not typescript(even it is), I did it in JS and got another error about the file extension.

Code sample

export const crates_rpm = async (package_name:string) => {
    const crawler = new PlaywrightCrawler({
      requestHandler: async ({ request, page }) => {
        const title = await page.title();
        console.log(title);
        if (request.label === 'repository') {
          const commitText = await page.getByRole('listitem').filter({ hasText: 'commits' }).textContent();
          const stars = await page.$$eval('main', (github) => github.map((star) => star?.querySelector('div.Layout-sidebar > div.BorderGrid--spacious > div.hide-sm > div.BorderGrid-cell')?.querySelector('div > a > strong')?.textContent));
          const numberStrings = commitText?.match(/\d+/g);
          const commitCount = Number(numberStrings?.join(''));
          console.log(commitCount);
          await Dataset.pushData({
            ...request.userData,
            commitCount,
            stars: stars[0],
          });
        } else {
          await page.waitForFunction(() => document.querySelector('ul._list_181lzn > li'));
          const main_package = await page.$$eval('main', (package_lib) => {
            return package_lib.map((package_el) => {
              const package_name = package_el.querySelector('h1._heading_8qtlic > span')?.textContent;
              const keywords = Array.from(package_el.querySelectorAll('ul._keywords_8qtlic > li')).map((keyList) => keyList?.textContent?.trim());
              const toText = (element:HTMLElement) => element && element.textContent?.trim();
              const toLink = (element:HTMLElement) => element && element.getAttribute('src');
              const time_uploaded = package_el?.querySelector('time._date_rj9k29')?.getAttribute('datetime');
              const license_name = package_el?.querySelector('div._license_rj9k29 > span > a')?.textContent;
              const license_url = package_el?.querySelector('div._license_rj9k29 > span > a')?.getAttribute('href');
              const version = package_el?.querySelector('small')?.innerText;
              const description = package_el?.querySelector('div._description_8qtlic')?.textContent?.trim();
              const total_downloads = package_el.querySelector('span._num__align_87huyj')?.textContent?.trim();
              const license = { type: license_name?.trim(), url: license_url };
              const links = Array.from(package_el.querySelectorAll('div._links_rj9k29 > div')).map((link) => {
                return {
                  name: link.querySelector('div > h2._heading_rj9k29')?.innerHTML,
                  URL: link.querySelector('div._content_t2rnmm > a._link_t2rnmm')?.getAttribute('href'),
                };
              });
              const owners = Array.from(package_el?.querySelectorAll('ul._list_181lzn > li')).map((owners) => {
                const name = owners?.querySelector('a > span')?.textContent?.trim();
                const profile = owners?.querySelector('img')?.getAttribute('src');
                const user_url = owners?.querySelector('a')?.getAttribute('href');
                return { name, profile, user_url };
              });
              const package_obj = { keywords, package_name, time_uploaded, license, version, description, links, total_downloads, owners}
              return package_obj
            });
          });
          const requests = main_package.map(($package) =>
              new Request({
                url:  `'${$package.links?.find((github) => github?.name === "Repository")?.URL}'`,
                label: 'repository',
                userData: $package,
              })
          );
          await crawler.addRequests(requests);
          await Dataset.pushData(main_package[0]);
        }
      },
    });
    await crawler.run([`https://crates.io/crates/${package_name}`]);
    const raw_dataset = await Dataset.getData();
    const dataset = raw_dataset.items.pop();
    return dataset;
};

Package version

[email protected]

Node.js version

v18.12.1

Operating system

Windows

Apify platform

  • [X] Tick me if you encountered this issue on the Apify platform

I have tested this on the next release

No response

Other context

No response

rockyessel avatar Jul 12 '23 06:07 rockyessel

The error comes from a validation of the requests, so it has to be something about what you are passing to the crawler.addRequests. Not sure why are you trying to create the Request objects yourself, try to change that bit to the following code:

const requests = main_package.map(($package) => ({
  // and double-check this is actually finding some links, as this can easily end up as `url: 'undefined'` which is not valid and would probably fail some validations too
  url: $package.links?.find((github) => github?.name === "Repository")?.URL,
  label: 'repository',
  userData: $package,
}));
await crawler.addRequests(requests);

In general, I would suspect this part, try to log what is in the $package object and what are you passing down to the crawler.addRequests().

cc @vladfrangu, that validation is obviously something to improve, we need to say what is wrong, not just something is wrong :]

I did it in JS and got another error about the file extension.

Sounds like some ESM-related issues, maybe you just forgot to put the .js extension in your imports, as that is required with ESM.

B4nan avatar Jul 12 '23 07:07 B4nan

cc @vladfrangu, that validation is obviously something to improve, we need to say what is wrong, not just something is wrong :]

We do have in-depth messages but they're logged when calling util.inspect on the error for now (I'm trying to get it to be the default message when logging error.message soon).


Also that URL is probably never going to be valid as its wrapped in single quotes. Dropping those should solve it too

vladfrangu avatar Jul 12 '23 08:07 vladfrangu

Right, good catch, modified my suggested snippet to remove the quotes.

B4nan avatar Jul 12 '23 09:07 B4nan

The error comes from a validation of the

Thanks, your right, but I don't understand why that was causing the error, because I was trying to crawl another link with he github hub of the package, so I can get the stars and commits, and other information later on.

But if there is another way to do that please help me. And thanks again. @vladfrangu

requestHandler: async ({ request, page }) => {
 if (request.label === 'repository') {
          const commitText = await page.getByRole('listitem').filter({ hasText: 'commits' }).textContent();
          const stars = await page.$$eval('main', (github) => github.map((star) => star?.querySelector('div.Layout-sidebar > div.BorderGrid--spacious > div.hide-sm > div.BorderGrid-cell')?.querySelector('div > a > strong')?.textContent));
          const numberStrings = commitText?.match(/\d+/g);
          const commitCount = Number(numberStrings?.join(''));
          console.log(commitCount);
          await Dataset.pushData({
            ...request.userData,
            commitCount,
            stars: stars[0],
          });

  }

const requests = main_package.map(($package) =>
              new Request({
                url:  `'${$package.links?.find((github) => github?.name === "Repository")?.URL}'`,
                label: 'repository',
                userData: $package,
              })
          );
          await crawler.addRequests(requests);
}

rockyessel avatar Jul 12 '23 20:07 rockyessel

Remove the single quotes from your url field! That should solve it

vladfrangu avatar Jul 12 '23 21:07 vladfrangu