cdxj-indexer icon indicating copy to clipboard operation
cdxj-indexer copied to clipboard

Extracting page titles / URLs from cdxj

Open jakebickford opened this issue 4 years ago • 1 comments

Hello!

Apologies if this is a silly question, but I'm wondering if cdxj-indexer has the ability to generate a list of pages (and potentially their titles) from a warc file? I am thinking of something analogous to the 'Pages' tab in replayweb.page, where it's a list of the captured pages and their titles, rather than the list of all of the many digital objects that make them up. I wondered if there is an http header that could be used for this with cdxj-indexer that would help with this, but I don't see anything obvious.

The use case is that it would be great to provide a human-friendly list of pages on our archive catalogue entries. At the moment I've been generating cdxj's with the default settings but I can see researchers finding this confusing.

Many thanks,

Jake

jakebickford avatar Apr 06 '21 11:04 jakebickford

I don't think this seems like something cdxj-indexer should support, since it does one thing (generate CDXJ files) and does it pretty well. But it does seem like something you might be better off writing as a custom utility program that uses warcio directly? Getting a page title will require parsing responses that are html with something like beautifulsoup? This seems to work ok?

  import bs4
  import sys

  from warcio.archiveiterator import ArchiveIterator

  warc_file = sys.argv[1]
  records = ArchiveIterator(open(warc_file, 'rb'))

  for record in records:
      rec_type = record.rec_type

      if record.http_headers:
          content_type = record.http_headers.get('content-type', '')
      else:
          content_type = ''

      if rec_type == 'response' and 'html' in content_type:
          url = record.rec_headers.get('WARC-Target-URI')
          doc = bs4.BeautifulSoup(record.raw_stream)
          print(f"{url}\t{doc.title.string}")

edsu avatar Jun 08 '22 16:06 edsu