browsertrix-crawler
browsertrix-crawler copied to clipboard
Switch to using JS WACZ
Replaces dependencies on py-wacz with importing js-wacz natively. Writes pages to either pages.jsonl (if seed) or extraPages.jsonl (if non-seed) Uses streams for writing pages Replaces --generateCDX with just moving tmp-cdx -> indexes Removes any dependencies on python
Fixes #484
Pending more testing and js-wacz release, using @tw4l branch for now!
Also noticing that js-wacz is logging strings to stdout, which breaks our logging format. Might want to see what we can do about that. I suppose if we call it as a subprocess via the cli we could capture the stdout and write it into the details of a crawler log line...
TODO:
- Add WACZ validation (not yet supported in js-wacz)
- Make CDXJ handling more memory-efficient in js-wacz (currently keeps all pages in memory, may OOM with large crawls)
- Possibly move CDXJ line handling in js-wacz from bin/cli.js into WACZ class
Other existing difference: the warcio.js cdx contains status code as number instead of as string, caught by current test failures.
Other existing difference: the warcio.js cdx contains status code as number instead of as string, caught by current test failures.
I'm wondering if the solution here isn't just to change the tests to expect a number. Looking at the CDXJ specification, it looks like examples also use an int for status code, e.g.: https://specs.webrecorder.net/cdxj/0.1.0/#example
I would assume ReplayWeb.page can handle input as a string or number, since our spec has said one thing while the crawler has been doing another? Of course important to verify.
Closing in favor of #673 (WACZ generation approach has been changed, as documented in #674)
Also worth noting that as of https://github.com/webrecorder/warcio.js/pull/75, CDXJ created by warcio.js now uses strings consistently for status, offset, and length