warc-specifications
warc-specifications copied to clipboard
WARC records for transcluded resources
Currently, if you want to know if one record was discovered embedded in another, e.g. an image required during rendering, that information is only held in the crawl log. When you are later trying to enable access to parts of a collection, it is very useful to know which items are required by another in order for rendering to succeed. Furthermore, as we progress towards browser-based archiving, this information becomes more reliable and more useful.
But, AFAICT, there is no way to record this in the WARCs. Perhaps we could use a new field that works a bit like WARC-Concurrent-To:
WARC-Transcluded-By: <urn:uuid:7e2b792d-3048-4c72-9b3e-c32db0e30091>
Which indicates that this WARC record was found to be transcluded by the item with the response with the given WARC-Record-ID.
What if you found a resource via URL a, but it is also required by URL b?
Yes, that would be a WARC revisit record with a WARC-Transcluded-By field, I think.
I realised getting H3 to support that might not be plausible, but I'd like to be able to support it in other tools.
No I mean, you crawl a it leads to resource x which is than marked as 'already included'. The crawler then crawls b, extracts link to x and tosses it out since it's 'already included'.
Nothing gets written to the WARC about the relationship between b and x.
That's what I mean. H3's frontier logic can't support this reliably. But warcprox could, for example.
@anjackson Fair enough.
How would this, practically speaking, be useful for replay or later use? It seems like this is similar to the Via metadata field that's already in use, but uses a warc record id, instead of the url, making it less useful. In fact, wasn't that the reason (lack of warc record-id indexs) that the WARC-Refers-To header was abandoned in favor of WARC-Refers-To-Target-URI / WARC-Refers-To-Date ?
Just noting this proposal has some similarities to WARC-Push-Promised-From (#43).
In relation to Via: ~~You have to have an index on the record id to be able to find the metadata record containing Via anyway~~ [edit: not true. I guess we're going to need -URI and -Date versions of every new linking field]. One thing that's better about this proposal than Via is it identifies which specific version of the referring resource it came from.
It's kind of a pity record ids are opaque. If we used something like target-uri+date+seqno as the record id it would neatly solve the index problem. Perhaps that'd introduce a new set of problems though.