warcat
warcat copied to clipboard
Add easy way to iterate over warc records
I was surprised that example provided in documentation:
>>> import warcat.model
>>> warc = warcat.model.WARC()
>>> warc.load('example/at.warc.gz')
>>> len(warc.records)
Reads everything into memory. And there is no easy way to iterate over records without loading everything into memory.
In my case, WARC files takes gigabytes of space, so I want to process those files record by record without loading everything into memory.
After reading sources I came up with this helper function:
import warcat.model
def readwarc(filename, types=('response',)):
f = warcat.model.WARC.open(filename)
has_more = True
while has_more:
record, has_more = warcat.model.WARC.read_record(f)
if not types or record.warc_type in types:
if isinstance(record.content_block, warcat.model.BlockWithPayload):
yield record, record.content_block.payload.get_file
elif hasattr(record.content_block, 'binary_block'):
yield record, record.content_block.binary_block.get_file
else:
yield record, record.content_block.get_file
for record, content in readwarc('pages.warc.gz'):
with content() as f:
# process f
I think it would be really useful if Warcat would provide an interface for lazy iteration over whole WARC file. I would image it to look something like this:
import warcat
for record in warcat.readrecords('pages.warc.gz'):
with record.content() as f:
# process f
Also, if I could get lxml, BeautifulSoap and json from records, something like this:
for record in warcat.readrecords('pages.warc.gz'):
record.lxml.xpath('//a')
record.soap.select('a')
record.json['a']
Then it would be really amazing.
If you agree with suggested API, I can create pull request with the implementation.
Sure, I think that sounds great!
Is there anything update for this suggestion? I encountered the same problem when loading large data:
warcat/util.py", line 66, in find_file_pattern raise ValueError('Search for pattern exhausted') ValueError: Search for pattern exhausted