specs icon indicating copy to clipboard operation
specs copied to clipboard

Tie CDXJ fields to WARC / HTTP headers

Open extua opened this issue 9 months ago • 4 comments

Section 6 of the CDXJ spec, defines the fields to be included in a JSON block as url, digest, mime, payload, filename, offset, length, and status. It might be useful to document where these things can be found when parsing a WARC file, as some of them are from the WARC header and some in the HTTP header.

From what I can see of looking through the CDXJ-indexer code, the fields map as follows:

url
WARC-Target-URI
digest
WARC-Payload-Digest
mime
If the WARC Record Type is 'revisit' then the type should be "warc/revisit", otherwise use the HTTP Content-Type header.
filename
WARC-Filename
offset
This is something calculated by counting through the WARC file byte-by-byte? I can't find it in the WARC spec.
length
WARC Content-Length
status
Status Code from the HTTP header

extua avatar Mar 20 '25 12:03 extua

I believe offset is the byte offset of the WARC record, so you can seek directly to the record you want without reading through the entire file.

TheTechRobo avatar Mar 20 '25 14:03 TheTechRobo

@TheTechRobo thanks, yes that's the purpose of offset, what I have a question about here is that most of these values are easily found by reading either the WARC header or the HTTP header of the associated record.

However, if someone was trying to index a WARC file containing several records, the offset of every record is not contained in the WARC file itself, and has to be calculated separately. Is this correct?

My current working solution is to read the WARC header to bytes, read the WARC body to bytes, and add the sum of bytes to a counter when looping through every record in the file.

More generally, for the other values which do map directly to the WARC record, this could be mentioned in the spec.

extua avatar Mar 24 '25 10:03 extua

However, if someone was trying to index a WARC file containing several records, the offset of every record is not contained in the WARC file itself, and has to be calculated separately. Is this correct?

Correct; WARC files can be concatenated, so a calculated offset wouldn't even necessarily be accurate.

My current working solution is to read the WARC header to bytes, read the WARC body to bytes, and add the sum of bytes to a counter when looping through every record in the file.

Most programming languages offer a tell method on files that you can use to get the current file position. You can call that before you read the WARC header to get the offset.

TheTechRobo avatar Mar 24 '25 11:03 TheTechRobo

@TheTechRobo thanks! I hadn't considered getting the file position when reading through.

extua avatar Apr 04 '25 06:04 extua