pisa
pisa copied to clipboard
Parsing CC-NEWS
When parsing WARC, we assume there is a TREC ID field. This is not true for CC-NEWS, so it needs to be taken into account.
WARC/1.0
WARC-Record-ID: <urn:uuid:4e7b712c-f7fd-4e31-81ad-b6b8b9a190c8>
Content-Length: 38840
WARC-Date: 2016-12-01T13:28:43Z
WARC-Type: response
WARC-Target-URI: http://www.banker.bg/finansov-dnevnik/read/parlamentut-prie-biudjetite-na-vsichki-ministerstva?utm_source=rss&utm_medium=click&utm_campaign=rss
Content-Type: application/http; msgtype=response
WARC-Payload-Digest: sha1:NKWIH3RIPHRCTJ7F7VM45KZFUR5DCXMW
WARC-Block-Digest: sha1:EOKDPEU3LQ7XESEPNRQO34YDAEDADIL6
We should modify our Document_Record to have a member function like title or id or something more generic than trecid: this would be a (unique?) text identifier of a document. Now, GOV2 or Clueweb would use TRECID, while CC-NEWS would use URL maybe?
This complicates things because we Clueweb and CC-NEWS are both WARC but would use a different way of returning that field. Which means we'd have to decouple parsing records (WARC, trec, plaintext) from the actual object/function used by the builder.
Which actually sounds like a good idea because why would we require all specific parsing methods (could be an external library, potentially not ours) to conform exactly to our interface.
Maybe we should rather just use a container object independently here: https://github.com/pisa-engine/pisa/blob/master/src/parse_collection.cpp#L47
It would have fields/accessors we actually need. Then we'd also get rid of type erasure, just move the members to another object.
Now, that would mean we need to differentiate between trecwarc and newscc formats, or be able to define warc field for the title identifier, e.g., --title-field url.
Can possibly be closed now, see #468 and #397