browsertrix-crawler icon indicating copy to clipboard operation
browsertrix-crawler copied to clipboard

Always add warcinfo records to all WARCs

Open ikreymer opened this issue 10 months ago • 2 comments

Fixes #553

Includes warcinfo records at the beginning of new WARCs, as well as the combined WARC. Makes the warcinfo record also WARC/1.1 to match the rest of the WARC records.

ikreymer avatar Apr 19 '24 02:04 ikreymer

As discussed, one thing we may want to add is additional data to the warcinfo records, perhaps:

screenshots: type: screenshot text: type: text info (for QA crawls): type: pageinfo regular warcs + combined warc: type: combined or type: web?

ikreymer avatar Apr 19 '24 17:04 ikreymer

regular warcs + combined warc: type: combined or type: web?

+1 for web as best term we've come up with so far for general WARC records capturing web traffic

tw4l avatar Apr 21 '24 08:04 tw4l

Will create separate issue / PR for additional type field

ikreymer avatar May 22 '24 22:05 ikreymer