warcio Do not allow writing records which content_stream() has been modified as it results in partial or empty content

Related to #64:

When writing a record into an archive one can read out its content via the streaming API and write the record (now with partial or empty content) into the archive resulting the loss of content. We clearly do not or at least should not want that. An Exception or a Warning should be raised instead of loosing content silently.

from warcio.warcwriter import WARCWriter

filename = 'test.warc.gz'

# Write WARC INFO record with custom headers
output_file = open(filename, 'wb')
writer = WARCWriter(output_file, gzip=True, warc_version='WARC/1.1')

# Custom information and stuff
info_headers = {'custom': 'stuff'}
info_record = writer.create_warcinfo_record(filename, info_headers)

custom_headers_raw = info_record.content_stream().read(6)  # TODO: After this writing should not be allowed

writer.write_record(info_record)

output_file.close()

Result (notice the partial payload):

WARC/1.1
WARC-Type: warcinfo
WARC-Record-ID: <urn:uuid:67a981ea-fece-49b9-834a-e3b660042cf5>
WARC-Filename: test.warc.gz
WARC-Date: 2019-08-15T10:44:53.778034Z
Content-Type: application/warc-fields
Content-Length: 15

: stuff

Aug 15 '19 10:08 dlazesz

Yes, this is an easy mistake to make when doing anything complicated.

Aug 15 '19 16:08 wumpus

I was bit by this recently as well. In my tool, I'm writing a log file (using logging). Before the process exits, I copy this log file into the WARC as a resource record by open()ing it in read mode and passing it to the WARCWriter. I also pass the record length (= file size) to avoid the temporary copy created in ensure_digest otherwise.

The above works perfectly fine – unless the log file is appended to between creating the record and writing it. I recently started adding very detailed debug-level logging messages, and some of those are produced in that window. What happens is that the length and digest are taken at create_warc_record time, but then the file is copied in its entirety into the record body. Due to the size mismatch, the file becomes unreadable.

Here's a small example to illustrate this (logging also flushes the file automatically on every message as of Python 3.8.4):

import os
import warcio


with open('test.txt', 'w') as logfile:
	logfile.write('foo\n')
	logfile.flush()
	with open('test.warc.gz', 'wb') as fp:
		writer = warcio.warcwriter.WARCWriter(fp)
		with open('test.txt', 'rb') as recordinput:
			recordinput.seek(0, os.SEEK_END)
			length = recordinput.tell()
			recordinput.seek(0)
			record = writer.create_warc_record('file://test.txt', 'resource', payload = recordinput, length = length)
			logfile.write('bar\n')
			logfile.flush()
			writer.write_record(record)

The result is a WARC like this:

WARC/1.0
WARC-Type: resource
WARC-Record-ID: <urn:uuid:7d16bc43-5eeb-4b63-b3a0-64119ec8890f>
WARC-Target-URI: file://test.txt
WARC-Date: 2020-07-15T01:51:23Z
WARC-Payload-Digest: sha1:6HJNF6JE5GDKZBX566ZWZFF434ZL53AV
WARC-Block-Digest: sha1:JZEOFSND2LFIU4EMWDGFIVYAKRHPWUBB
Content-Type: application/warc-record
Content-Length: 4

foo
bar

The content length and payload digest are from b'foo\n' but the block contents and block digest are from b'foo\nbar\n'. The digest difference happens because the block digest is only calculated on the write, not on record creation (which also means the file needs to be read thrice in total instead of only twice; is there a good reason for that?).

This is of course a special case. First, the file object passed into warcio isn't actually manipulated directly; the modification happens indirectly through the file system, which makes it much less obvious to spot. And second, it's only a file append, not a movement of the file pointer (through read or seek) or an overwrite of existing data. This specific case can actually be fixed fairly easily. I'm now using a wrapper around the file object which prevents access beyond the initial file length. It's also fixable in warcio: _iter_stream just needs to only read length bytes instead of reading until EOF. In fact, by requiring _iter_stream to read exactly that many bytes, the issue in the original report can also be kind of solved by raising an error if _iter_stream is unable to read that much. This would at least clearly indicate that something's wrong.

Jul 15 '20 02:07 JustAnotherArchivist

I think that this should definitely be addressed, writing after reading should be disallowed:

record.content_stream()
record.write_record() <- this should throw an error

Adding a non-streaming source as per #64 could provide an alternative.

@JustAnotherArchivist
Yes, perhaps the ensure_digest() in create record is unnecessary, it should compute both digests only when writing. If someone needs the digest sooner, they can call ensure_digest() explicitly.. Adding a length check is probably a good idea, but in general there's no way to prevent user from modifying record between create_warc_record() and write_record() except maybe combining that into a single call. Sometimes that can be useful for adding additional data to record or have a reference to the record, but its still possible to change it in a way that makes it invalid since you can always manipulate the stream still until its actually written..

Jul 15 '20 05:07 ikreymer

@ikreymer Indeed, short of making a full copy (as ensure_digest does when the length is unknown) with its significant performance and disk space impact for large records, it's impossible to prevent that. The simple cases which modify only the stream position (original issue) or the stream length (mine) can be solved though: keep the value of the stream position at create_warc_record time, then seek to that position before writing, and only write length bytes. This only works for seekable streams, of course.

Jul 15 '20 16:07 JustAnotherArchivist