warcio icon indicating copy to clipboard operation
warcio copied to clipboard

non-streaming interface would be useful

Open wumpus opened this issue 6 years ago • 6 comments

Right now the only interface for getting at the record content is record.content_stream().read(), which is streaming. I can't do that twice. So if I'm passing a record around in a program and want to access the record content in multiple places, I've ended up wrapping warcio's record with a class that has a .content() method.

That seems odd. Other packages like Requests offer both streaming and non-streaming interfaces.

Obviously we'd want to preserve streaming behavior -- pure streaming code should continue to not buffer all of the content in memory. One way to do that would be to save all of the content in memory only if .content() is called before .content_stream().read(), and make calling .content() after calling content_stream().read() raise an exception.

wumpus avatar Jan 25 '19 19:01 wumpus

The tool evolved from the need to support the streaming use case specifically, but I definitely see how this can be useful for certain situations. Then again, it might make it too easy to just store a multi-GB record in memory, which right now you'd have to do manually. Since it would be explicit with a .content() method, I guess that can work as it only happens when requested. Sure, if you'd like to make a PR for this, we can add it!

ikreymer avatar Jan 27 '19 17:01 ikreymer

I actually made a mistake in "warcio check" with the streaming interface! Triggering a digest check requires reading all of the payload, and I did that with

    record.content_stream().read()

Two reviewers didn't spot it :-)

(Yes, I'll fix it.)

wumpus avatar Jan 30 '19 20:01 wumpus

I just got bitten by this as well.

Another related issue is that it isn't possible to read both the raw stream and the decoded/content stream for a record.


Here's how I worked around it for my use case, starting with record returned from iterating over a WARCIterator:

rawPayload = io.BytesIO(record.raw_stream.read())
if record.http_headers:
	recordCopy = warcio.recordloader.ArcWarcRecord(record.format, record.rec_type, record.rec_headers, rawPayload, record.http_headers, record.content_type, record.length)
	decodedPayload = io.BytesIO(recordCopy.content_stream().read())
	rawPayload.seek(0)
else:
	decodedPayload = rawPayload

With the obvious caveats that the two streams might be the same object (i.e. share the stream position) and that you need enough RAM to keep everything in memory a couple times if it gets decoded.

JustAnotherArchivist avatar Jan 05 '21 02:01 JustAnotherArchivist

Can we expect any improvement on this topic in the future?

This is still a major headache. The worst form is when I want to pass around multiple select records which are stored distant points in the WARC file. I'm getting zlib.error: Error -3 while decompressing data: incorrect data check errors when the payload slips out of the cache.

My workaround is the following:

stream = open('my_stuff.warc.gz', 'rb')
archive_it = ArchiveIterator(stream)
my_rec = next(it)
# Encode
my_rec_pos = archive_it.get_record_offset()
pass_this_around = (stream, my_rec_pos)
# ...
# Decode
stream, my_rec_pos = pass_this_around
sream.seek(my_rec_pos)
my_rec = next(iter(ArchiveIterator(stream)))

IMO this could be a quick and dirty internal workaround as well. As it could solve most of the issues with the streaming API.

dlazesz avatar Dec 09 '21 15:12 dlazesz

@dlazesz your solution does not work in a streaming environment. The solution at the top works for both streaming and non-streaming. I think your solution also breaks the checksum code. That said, if it works for you, do use it!

@ikreymer reasonably suggested that the interface I describe at the top, with a new record.content interface, be even more explicit, such as having to pass a kwarg to ArchiveIterator(), something like cache_content_in_memory with a default of False. And it could only cache in memory if the stream is unseekable.

I think there's a way to preserve checksums after a seek, if it's a seek back to the start of the content -- the checksums can be reset.

Anyway, the main reason why this isn't already in the code is due to lack of time on my part! I am glad to hear that more people are using the code in interesting ways, which is how we find these problems.

wumpus avatar Dec 09 '21 19:12 wumpus

@dlazesz your solution does not work in a streaming environment. The solution at the top works for both streaming and non-streaming. I think your solution also breaks the checksum code. That said, if it works for you, do use it!

The checksum is fine after the seeking and copying. At least for request and response records which I've tested.

If the stream is seekable -- which IMO is the case most of the time -- my solution works. The rest of the cases you must fiddle with caching the content with e.g. content(). Anyhow these two cases should be separated.

@ikreymer reasonably suggested that the interface I describe at the top, with a new record.content interface, be even more explicit, such as having to pass a kwarg to ArchiveIterator(), something like cache_content_in_memory with a default of False. And it could only cache in memory if the stream is unseekable.

I think many issues arose from the odd behaviour of the streaming interface which IMO should be optional and turned off by default. As ikeymer said: The tool evolved from the need to support the streaming use case specifically.

Even if you decide the other way around, I would be happy to be able to enable some a seeking mechanism for the non-streaming use cases instead of creating workarounds because strange zlib errors.

Hope you'll have some time for this in the future! :crossed_fingers:

dlazesz avatar Dec 10 '21 16:12 dlazesz