crawlee icon indicating copy to clipboard operation
crawlee copied to clipboard

Playwright requires installation via `npx playwright install`

Open mnmkng opened this issue 1 year ago • 19 comments

Which package is this bug report for? If unsure which one to select, leave blank

@crawlee/playwright (PlaywrightCrawler)

Issue description

When you install a project with the following package.json it fails on first start asking to npx install playwright.

It's not a great first experience to get a huge error on first run, so we should either:

  1. ensure that Playwright browsers are installed together with @crawlee/playwright or
  2. document everywhere, most importantly on the Crawlee homepage, that this command needs to be run before Playwright can be started.

It's likely that to reproduce this, you first need to npx playwright uninstall to get into a "new user state".

This probably also impacts all our CLI templates.

Code sample

{
    "name": "my-module",
    "version": "0.0.1",
    "dependencies": {
        "crawlee": "^3.0.0",
        "playwright": "*"
    },
    "type": "module",
    "scripts": {
        "start": "node main.js"
    },
    "author": "Me!"
}

Package version

3.7.1

Node.js version

v18.12.1

Operating system

MacOS

Apify platform

  • [ ] Tick me if you encountered this issue on the Apify platform

I have tested this on the next release

no

Other context

No response

mnmkng avatar Jan 08 '24 16:01 mnmkng

So you can reproduce this from some template, or by installing crawlee into an empty project? Because the templates are working fine on my end.

I believe the browsers are installed via postinstall hook nowadays, cc @vladfrangu

B4nan avatar Jan 08 '24 16:01 B4nan

Yep, both apify and crawlee templates have a postinstall hook (that also ensures it won't run in our docker images, but will run everywhere else)

We should probably document the CLI command to users who are upgrading to newer playwright or are making new projects without our CLI. Could even just make a command in CLI to auto fix old projects (npx crawlee migrate-new-playwright?)

vladfrangu avatar Jan 08 '24 16:01 vladfrangu

Hmm, probably not worth introducing a new command just to wrap an existing playwright command that's documented in the error.

So all the "default" and "new user" paths of installing crawlee are covered with this then? And I was just unlucky because I reinstalled an old project?

mnmkng avatar Jan 08 '24 16:01 mnmkng

This is fixed for any users who create their project via apify create or crawlee create.. Otherwise, the postinstall hook needs to be added into the project (which is why I suggested making a cmd for it, to automate it for users)

vladfrangu avatar Jan 08 '24 16:01 vladfrangu

Cant we have it on the @crawlee/playwright package?

B4nan avatar Jan 08 '24 16:01 B4nan

Well...we install the package all the time, so running the command when people don't use playwright isn't ideal either... Not sure what the best solution is

vladfrangu avatar Jan 08 '24 16:01 vladfrangu

Hmm but in the end, we want this to work with the crawlee package too, same for puppeteer. The browsers used to be installed before too, right?

B4nan avatar Jan 08 '24 18:01 B4nan

Can we have some env var to skip the downloads in the postinstall script? I'd probably just install them all the time and allow opting out, that was the previous behavior before all this mess happened.

B4nan avatar Jan 08 '24 18:01 B4nan

I am getting the dreaded:

╔═════════════════════════════════════════════════════════════════════════╗
║ Looks like Playwright Test or Playwright was just installed or updated. ║
║ Please run the following command to download new browsers:              ║
║                                                                         ║
║     npx playwright install                                              ║
║                                                                         ║
║ <3 Playwright Team                                                      ║
╚═════════════════════════════════════════════════════════════════════════╝

With the only code change being adding a new express route. I also defined 2 request queues following some internet skim reading. Running locally everything works as expected, but this issue is occuring via GCP Cloud Run.

I am using apify/actor-node-playwright-chrome:18 in my Dockerfile.

My logs show this error:

browserType.launchPersistentContext: Executable doesn't exist at /home/myuser/pw-browsers/chromium-1091/chrome-linux/chrome

and having pulled down the image and running locally via docker I can confirm that the only browsers present in pw-browsers are the following:

# cd pw-browsers
# ls
chrome  chromium-1097  ffmpeg-1009

New route:

app.get("/lemon", async (req, res, message) => {
  
  const targetLink = req.query.link;
  if (!targetLink) {
    throw new Error('The link query parameter is required in order to know which lemon to crawl.');
  }

  const startUrl = `${targetLink}`;
  console.log(`We've received the lemon to crawl as: ${startUrl}`);


  const crawler = new PlaywrightCrawler(
    {
      requestHandler: router.getHandler('TANGY_LEMON'),
      minConcurrency: 5,
      requestQueue: lemonRequestQueue
    },
    new Configuration({
      persistStorage: false,
    })
  );

  await crawler.run([startUrl]);

  const crawlerOutput = await crawler.getData();  

  return res.send(crawlerOutput);
});

Any advice on how to resolve or if this is unrelated would be amazing. Before I had simply followed the documentation instructions with a top-level express.js route. I am using a specific handler needed only for lemon as the top-level route is scraping a more broad tree of pages where the final outcome is lemon but I need to be able to request a specific crawl of a lemon using my route. Also please don't bully the choice of a query param here, quick & lazy was the thought.

RowanAldean avatar Jan 30 '24 20:01 RowanAldean

Sounds like your playwright version doesn't match the one we use when building images. You should specify it in the image version tag (so you'd have apify/actor-node-playwright-chrome:18-1.40.0 for playwright 1.40.0 as an example! That should solve the issue, but please follow up if it doesn't

vladfrangu avatar Jan 30 '24 20:01 vladfrangu

In fairness, I was using a wildcard for the playwright version in my package.json - I fixed it to ^1.40.0 as is the case for @playwright/test and still not resolved :(

Error message is the same as above regarding missing browser chromium-1091

RowanAldean avatar Jan 30 '24 20:01 RowanAldean

If you use a range like that it'll still install the latest version that matches, you'd need to either use ~ for the range or a fixed version 😅

If you're able to make a reproducible sample in a repository that'd help a bunch too!

vladfrangu avatar Jan 30 '24 20:01 vladfrangu

I will move to relevant thread as this relates to, having narrowed down the problem to the Dockerfile or atleast this element of my pipeline.

Confirmed by simple rebuilding and redeploying an unchanged project (i.e expected to be the equivalent to a rollback) and still getting the same error around the lack of that specific browser chromium-1091. I will now try pinning version of playwright or using latest apify docker image or both (please don't make me create and serve my own base image... such overkill suggested in above thread by other user).

RowanAldean avatar Jan 30 '24 21:01 RowanAldean

I am getting same error

2024-06-30T20:00:31.110Z Error occurred browserType.launch: Executable doesn't exist at /home/myuser/pw-browsers/chromium-1117/chrome-linux/chrome
2024-06-30T20:00:31.112Z ╔═════════════════════════════════════════════════════════════════════════╗
2024-06-30T20:00:31.114Z ║ Looks like Playwright Test or Playwright was just installed or updated. ║
2024-06-30T20:00:31.115Z ║ Please run the following command to download new browsers:              ║
2024-06-30T20:00:31.117Z ║                                                                         ║
2024-06-30T20:00:31.120Z ║     npx playwright install                                              ║
2024-06-30T20:00:31.121Z ║                                                                         ║
2024-06-30T20:00:31.123Z ║ <3 Playwright Team                                                      ║
2024-06-30T20:00:31.125Z ╚═════════════════════════════════════════════════════════════════════════╝
2024-06-30T20:00:31.126Z     at scheduleLadder (/home/myuser/dist/main.js:295:34)
2024-06-30T20:00:31.128Z     at main (/home/myuser/dist/main.js:375:26)
2024-06-30T20:00:31.130Z     at /home/myuser/async file:/home/myuser/dist/main.js:378:1 {
2024-06-30T20:00:31.133Z   name: 'Error'
2024-06-30T20:00:31.135Z }

hengliu0919 avatar Jun 30 '24 20:06 hengliu0919