warcio icon indicating copy to clipboard operation
warcio copied to clipboard

Add an iterator of HTTP exchanges

Open ibnesayeed opened this issue 5 years ago • 2 comments

I can see many use cases where it would be useful to be able to iterate over the WARC records and yield related HTTP Request and Response records together as a tuple. I understand that WARC does not guarantee presence of the pair in the same file or in any specific order, but in a typical archival collection we might find them close enough. This iterator could be based on the best-effort attempts.

half_exchanges = {}

for record in ArchiveIterator(stream):
    # Filter any non-HTTP/HTTPS records out
    if record.rec_headers.get_header('uri').startswith(('http:', 'https:')):
        if record.rec_type == 'request':
            id = record.rec_headers.get_header('WARC-Concurrent-To')
        elif record.rec_type == 'response':
            id = record.rec_headers.get_header('WARC-Record-ID')

        if id:
            if id not in half_exchanges:
                half_exchanges[id] = record
            else:
                if record.rec_type == 'request':
                    req = record
                    res = half_exchanges[id]
                else:
                    req = half_exchanges[id]
                    res = record
                # Remove temporary record that is paired and yield the pair
                del half_exchanges[id]
                yield (req, res)

The above code is one possible way to implement it in which we keep track of unpaired records in a dictionary keyed by the identifier that glues corresponding request and response records together. Once a pair is matched, we can delete that bookkeeping entry and return the pair. This rudimentary approach, however, should be taken with a pinch of salt as there is potential of memory leak for records that won't be paired indefinitely. For WARCs with chaotic ordering the memory requirement may go high due to the growth of of bookkeeping, but related WARC records are generally placed in close proximity, in which case the bookkeeping should be small.

Also, this code does not talk about dealing with revisit records for now, but something should be done for those too.

ibnesayeed avatar Jun 25 '19 20:06 ibnesayeed

I can see many use cases...

Can you describe some of these, @ibnesayeed?

machawk1 avatar Jul 02 '19 18:07 machawk1

@machawk1: Can you describe some of these, @ibnesayeed?

Here are a few on top of my head, but I am sure there can be more use cases:

  • Indexing WARCs for replay in tools like IPWB (or other replay systems) in a way that include both request and response headers (currently, most replay systems completely ignore request records)
  • @edsu's recent Spark + warcio blog post demonstrates the need and how they worked around it on a large scale
  • A tool like WARC2HAR (if and when built) would need this functionality
  • Our recent Supporting Web Archiving via Web Packaging would benefit from it if we were to use WARCs for storage and then create HTTP bundles of exchanges for effective replay

ibnesayeed avatar Jul 02 '19 18:07 ibnesayeed