cdxj-indexer
cdxj-indexer copied to clipboard
Extracting page titles / URLs from cdxj
Hello!
Apologies if this is a silly question, but I'm wondering if cdxj-indexer has the ability to generate a list of pages (and potentially their titles) from a warc file? I am thinking of something analogous to the 'Pages' tab in replayweb.page, where it's a list of the captured pages and their titles, rather than the list of all of the many digital objects that make them up. I wondered if there is an http header that could be used for this with cdxj-indexer that would help with this, but I don't see anything obvious.
The use case is that it would be great to provide a human-friendly list of pages on our archive catalogue entries. At the moment I've been generating cdxj's with the default settings but I can see researchers finding this confusing.
Many thanks,
Jake
I don't think this seems like something cdxj-indexer should support, since it does one thing (generate CDXJ files) and does it pretty well. But it does seem like something you might be better off writing as a custom utility program that uses warcio directly? Getting a page title will require parsing responses that are html with something like beautifulsoup? This seems to work ok?
import bs4
import sys
from warcio.archiveiterator import ArchiveIterator
warc_file = sys.argv[1]
records = ArchiveIterator(open(warc_file, 'rb'))
for record in records:
rec_type = record.rec_type
if record.http_headers:
content_type = record.http_headers.get('content-type', '')
else:
content_type = ''
if rec_type == 'response' and 'html' in content_type:
url = record.rec_headers.get('WARC-Target-URI')
doc = bs4.BeautifulSoup(record.raw_stream)
print(f"{url}\t{doc.title.string}")