Tie CDXJ fields to WARC / HTTP headers
Section 6 of the CDXJ spec, defines the fields to be included in a JSON block as url, digest, mime, payload, filename, offset, length, and status. It might be useful to document where these things can be found when parsing a WARC file, as some of them are from the WARC header and some in the HTTP header.
From what I can see of looking through the CDXJ-indexer code, the fields map as follows:
- url
- WARC-Target-URI
- digest
- WARC-Payload-Digest
- mime
- If the WARC Record Type is 'revisit' then the type should be "warc/revisit", otherwise use the HTTP Content-Type header.
- filename
- WARC-Filename
- offset
- This is something calculated by counting through the WARC file byte-by-byte? I can't find it in the WARC spec.
- length
- WARC Content-Length
- status
- Status Code from the HTTP header
I believe offset is the byte offset of the WARC record, so you can seek directly to the record you want without reading through the entire file.
@TheTechRobo thanks, yes that's the purpose of offset, what I have a question about here is that most of these values are easily found by reading either the WARC header or the HTTP header of the associated record.
However, if someone was trying to index a WARC file containing several records, the offset of every record is not contained in the WARC file itself, and has to be calculated separately. Is this correct?
My current working solution is to read the WARC header to bytes, read the WARC body to bytes, and add the sum of bytes to a counter when looping through every record in the file.
More generally, for the other values which do map directly to the WARC record, this could be mentioned in the spec.
However, if someone was trying to index a WARC file containing several records, the offset of every record is not contained in the WARC file itself, and has to be calculated separately. Is this correct?
Correct; WARC files can be concatenated, so a calculated offset wouldn't even necessarily be accurate.
My current working solution is to read the WARC header to bytes, read the WARC body to bytes, and add the sum of bytes to a counter when looping through every record in the file.
Most programming languages offer a tell method on files that you can use to get the current file position. You can call that before you read the WARC header to get the offset.
@TheTechRobo thanks! I hadn't considered getting the file position when reading through.