browsertrix-crawler icon indicating copy to clipboard operation
browsertrix-crawler copied to clipboard

Consider refactoring on top of browserless/chrome

Open ikreymer opened this issue 3 years ago • 4 comments

The https://github.com/browserless/chrome image is fairly impressive and provides a great core dockerized browser, with many of the features needed for browsertrix crawler, including screencasting and the interactive debugger.

Need to evaluate further, but looks like a really promising path to refactor Browsertrix Crawler to extend the browserless image and use its existing queuing capabilities, instead of puppeteer-cluster. Browserless also comes with a full-featured REST API, eg: https://docs.browserless.io/docs/screencast.html

Would also provide a chance to rewrite Browsertrix Crawler in typescript!

ikreymer avatar Feb 18 '22 07:02 ikreymer

Been thinking about this project and I think it's a great idea. I think the first question to answer is how backwards compatible do we want it to be? Is the test coverage sufficient and do we need some new type of test to validate the replacement?

caj-larsson avatar Aug 10 '22 07:08 caj-larsson

@ikreymer Would such a move have any impact on Firefox support (from WARC/ZIM end-user perspective)?

kelson42 avatar Aug 10 '22 07:08 kelson42

As an FF user it's weird that I didn't think about that, what we can do with relative ease is to build a small shim on-top of browserless/chrome that only exposes as much of the API as we wish to use. We name and specify this interface as well as provide and implementation based on browserless but we keep the interface as small and clean as possible.

This way we can rely on browserless/chrome without setting the bar for Firefox usage impossibly high.

caj-larsson avatar Aug 10 '22 07:08 caj-larsson

@ikreymer Would such a move have any impact on Firefox support (from WARC/ZIM end-user perspective)?

Just to be clear, we've always been using chrome for crawling so this shouldn't affect FF end-users more than it currently does.

rgaudin avatar Aug 10 '22 10:08 rgaudin

Decided against using browserless/chrome as base image, as that doesn't actually help with the core crawling functionality that we'll need, which is fairly specific and has a lot of requirements, but it would add a bunch of unnecessary things and results in a larger image. Instead, I think we do want to switch to Playwright and eliminate puppeteer cluster. Will open a new issue for things we do want to focus on.

ikreymer avatar Oct 21 '22 04:10 ikreymer