browsertrix-crawler
browsertrix-crawler copied to clipboard
Always add warcinfo records to all WARCs
Fixes #553
Includes warcinfo
records at the beginning of new WARCs, as well as the combined WARC.
Makes the warcinfo record also WARC/1.1 to match the rest of the WARC records.
As discussed, one thing we may want to add is additional data to the warcinfo records, perhaps:
screenshots: type: screenshot
text: type: text
info (for QA crawls): type: pageinfo
regular warcs + combined warc: type: combined
or type: web
?
regular warcs + combined warc:
type: combined
ortype: web
?
+1 for web
as best term we've come up with so far for general WARC records capturing web traffic
Will create separate issue / PR for additional type field