browsertrix-crawler icon indicating copy to clipboard operation
browsertrix-crawler copied to clipboard

Multi-browser support: Is switching from puppeteer to playwright feasible?

Open pirate opened this issue 4 years ago • 2 comments

https://github.com/microsoft/playwright-python is almost call-for-call compatible with puppeteer, and gives you access to Firefox and Webkit as well.

Certainly not a priority by any means, but would you hypothetically be open to a PR for this?

pirate avatar Feb 03 '21 01:02 pirate

Hm, that might not be a bad idea at this point! The initial intention was to allow support for specific versions of browsers (eg. specific versions of Chrome), but that is probably less important. Playwright comes with patched version of Firefox and Safari (WebKit) to ensure the functionality is aligned, and so it might make sense.. I suppose the Docker image could ship with all 3 browsers, or just one at a time? It looks like headful mode is also well supported, which will be key to capturing many sites that don't quite work in headless and other quirks related to headless browsing.. So far, the puppeteer/playwright API needed is pretty minimal, and looks like there is a way to access the raw CDP session as well.. I think the answer is 'yes', but lets discuss more what you had in mind : ) Also, would probably just use the JS version of playwright, as this is already using the JS puppeteer..

ikreymer avatar Feb 03 '21 04:02 ikreymer

Ah almost forgot! A blocker is that this is using puppeteer-cluster, which has not been ported to support playwright, though in theory would also be possible: thomasdondorf/puppeteer-cluster#326

This is needed for parallelization of the crawl.

ikreymer avatar Feb 03 '21 05:02 ikreymer

Moved to Playwright in https://github.com/webrecorder/browsertrix-crawler/commit/82808d813321c6c5860a529414e20e2638887b31

tw4l avatar Mar 21 '23 15:03 tw4l

Amazing work! Excited for this update

pirate avatar Mar 21 '23 16:03 pirate

Does this affect custom drivers? Will they need to be converted to Playwright?

karenhanson avatar Mar 21 '23 16:03 karenhanson

@karenhanson Thanks for flagging this. When the version of the crawler with Playwright is released, it will be a breaking change and will require that custom drivers are converted to Playwright, yes. Current drivers will continue to work with prior versions of the crawler. When we get closer to a release with Playwright we will do communication around this and can give guidance around migrating existing Puppeteer scripts - the Puppeteer and Playwright APIs are largely quite similar, but there are a few notable differences and we've been refactoring the crawler which may introduce some additional changes.

tw4l avatar Mar 21 '23 16:03 tw4l

@karenhanson hopefully the conversion will be mostly straightforward, as there's an official guide for it: https://playwright.dev/docs/puppeteer as well as an automated conversion tool as well: https://github.com/checkly/puppeteer-to-playwright

We should probably add this to #224 as we need general docs for drivers.

ikreymer avatar Mar 21 '23 16:03 ikreymer

Thank you both so much - I'll take a look at this.

karenhanson avatar Mar 21 '23 16:03 karenhanson