browsertrix-crawler icon indicating copy to clipboard operation
browsertrix-crawler copied to clipboard

Switch to using JS WACZ

Open ikreymer opened this issue 1 year ago • 3 comments

Replaces dependencies on py-wacz with importing js-wacz natively. Writes pages to either pages.jsonl (if seed) or extraPages.jsonl (if non-seed) Uses streams for writing pages Replaces --generateCDX with just moving tmp-cdx -> indexes Removes any dependencies on python

Fixes #484

Pending more testing and js-wacz release, using @tw4l branch for now!

ikreymer avatar Mar 22 '24 02:03 ikreymer

Also noticing that js-wacz is logging strings to stdout, which breaks our logging format. Might want to see what we can do about that. I suppose if we call it as a subprocess via the cli we could capture the stdout and write it into the details of a crawler log line...

tw4l avatar Mar 22 '24 13:03 tw4l

TODO:

  • Add WACZ validation (not yet supported in js-wacz)
  • Make CDXJ handling more memory-efficient in js-wacz (currently keeps all pages in memory, may OOM with large crawls)
  • Possibly move CDXJ line handling in js-wacz from bin/cli.js into WACZ class

tw4l avatar Mar 22 '24 20:03 tw4l

Other existing difference: the warcio.js cdx contains status code as number instead of as string, caught by current test failures.

ikreymer avatar Jul 03 '24 23:07 ikreymer

Other existing difference: the warcio.js cdx contains status code as number instead of as string, caught by current test failures.

I'm wondering if the solution here isn't just to change the tests to expect a number. Looking at the CDXJ specification, it looks like examples also use an int for status code, e.g.: https://specs.webrecorder.net/cdxj/0.1.0/#example

I would assume ReplayWeb.page can handle input as a string or number, since our spec has said one thing while the crawler has been doing another? Of course important to verify.

tw4l avatar Aug 14 '24 15:08 tw4l

Closing in favor of #673 (WACZ generation approach has been changed, as documented in #674)

Also worth noting that as of https://github.com/webrecorder/warcio.js/pull/75, CDXJ created by warcio.js now uses strings consistently for status, offset, and length

tw4l avatar Aug 26 '24 21:08 tw4l