openwayback icon indicating copy to clipboard operation
openwayback copied to clipboard

Support WARC conversion records

Open PsypherPunk opened this issue 11 years ago • 4 comments

S'sheet line: 5 For whom? BNF, BL, DN Notes: CDX/indexing consequences Est. Milestone: Ilya to check.

PsypherPunk avatar Dec 18 '13 10:12 PsypherPunk

Here is some background, I believe.

https://groups.google.com/d/msg/openwayback-dev/tIqz5zB2GYE/LE8wi-dwzhsJ http://sourceforge.net/mailarchive/message.php?msg_id=31011222 http://sourceforge.net/mailarchive/message.php?msg_id=27199087

egh avatar Jan 09 '14 20:01 egh

I'm a graduate student at the Oxford Internet Institute. We're using heritrix for various web crawling activities (not necessarily for long-term archiving) and are producing conversion records of various items (mainly to extract text from .pdf and .doc files for automated analysis).

I have read the background papers highlighted on this thread, and it seems to me that the extensions to openwayback and the CDX format proposed would be tricky for the following reasons:

  • The WARC format has no standardised way of recording what processing has been done to a conversion record, as far as I can tell
  • Ditto for the original fetch date (as my colleagues and I are converting essentially on fetch, we're keeping the fetch date from the original record, which seems reasonable)
  • There may not, in fact, be a Refers-To header, as it's only marked as "wherever possible" in the specification: this would break the suggested linking of records anyway

Obviously these challenges can be overcome, but it'd be a lot of work. In the meantime, it strikes me that conversion records could at least be handled like resource records: they have all the usual headers and getting openwayback to index them would therefore be trivial. For use cases like ours, that is all that would be required (by indexing the conversion records in place of the original responses).

Would such an interim change be welcomed? Not that I have any objection to keeping it as a separate fork if it isn't...

pmyteh avatar Apr 10 '14 01:04 pmyteh

I think the issues mirror to some extent the ones we discovered with regards to revisit records. There was the (unreasonable in practice) assumption made that WARC record ids could be used to refer to other WARC records.

Probably the first thing we need to do here is to chart what issues exist and then propose (minimal) amendments (ideally additions) to the WARC spec to deal with them.

kris-sigur avatar Apr 10 '14 10:04 kris-sigur

OK, I've started with the trivial interim changes. Using the BDB indexer, at least, the conversion records were already being added to the index. When clicked on in the search results window, it was generating a bad record error. A one-line change to the WARC code to mirror the existing handling of resource records seems to have done the job and things are working well. The change is on the pmyteh/openwayback fork on the conversionrecords branch if anyone wants to play with it.

I'll do some more testing and then make a pull request; it seems strictly better than the current behaviour even if it doesn't implement all the conversion record features that people want.

pmyteh avatar Apr 14 '14 15:04 pmyteh