browsertrix icon indicating copy to clipboard operation
browsertrix copied to clipboard

Document new WARC fields in 1.x crawler-produced WACZ files

Open tuehlarsen opened this issue 11 months ago • 3 comments

Browsertrix Cloud Version

v1.9.3-79a217b

What did you expect to happen? What happened instead?

I have found some new WARC fields and files in the newest WACZ from beta.browsertrix release: 1.9.3 Browsertrix-Crawler 1.0.0-beta.6 (with warcio.js 2.2.1):

Here a snip from the warc file: WARC/1.1^M WARC-Page-ID: 61046c48-286b-485a-a8ed-9974f79a179d^M WARC-Resource-Type: document^M WARC-JSON-Metadata: {"cert":{"issuer":"GlobalSign Atlas R3 DV TLS CA 2024 Q1","ctc":"0"}}^M ...

I don't find any proposal concerning the WARC-Page-ID here : https://iipc.github.io/warc-specifications/ . I also found a text and screendump warc.gz file, but no documentation. All files validates with the newest version of jwat here: https://github.com/netarchivesuite/jwat-tools/releases/tag/v0.7.2-beta1

Any comments?

Step-by-step reproduction instructions

see above

Additional details

No response

tuehlarsen avatar Mar 11 '24 20:03 tuehlarsen

Are above 3 new warc fields mandatory for modern browserbased replay and are they defacto used in other tools today?

tuehlarsen avatar Mar 13 '24 10:03 tuehlarsen

Hi @tuehlarsen, longer explanation coming but in short:

  • WARC-Resource-Type: Proposal created at https://github.com/iipc/warc-specifications/issues/96; this is used to differentiate resources fetched via JavaScript from those loaded directly in the page, and has other possibilities for future analysis of crawls
  • WARC-Page-ID: We added this in Browsertrix to be able to easily associate pages between original crawls and QA replay crawls
  • WARC-JSON-Metadata: Proposal created at https://github.com/iipc/warc-specifications/issues/27

None of these additions should cause WARC validation to fail or cause any replay issues. Most software other than Browsertrix will simply ignore the fields, as is suggested in section 5.1 of the WARC 1.1 specification:

Because new fields may be defined in extensions to the core WARC format, WARC processing software shall ignore fields with unrecognized names.

Re: your comment about missing documentation for screenshot and text WARC files, that is noted and should be coming shortly! As of the latest 1.0.0 crawler beta release, these WARCs will also be prefixed if a WARC prefix is specified.

tw4l avatar Mar 14 '24 18:03 tw4l

I hope it will be more explicitly - as it is of great importance for large older web archives what the new warc fields are for and what they will be used for in the future - it requires some syntax definition and description that can be input to a later iso standardization process..

tuehlarsen avatar Apr 09 '24 17:04 tuehlarsen