warc-specifications icon indicating copy to clipboard operation
warc-specifications copied to clipboard

WARC records for transcluded resources

Open anjackson opened this issue 11 years ago • 7 comments

Currently, if you want to know if one record was discovered embedded in another, e.g. an image required during rendering, that information is only held in the crawl log. When you are later trying to enable access to parts of a collection, it is very useful to know which items are required by another in order for rendering to succeed. Furthermore, as we progress towards browser-based archiving, this information becomes more reliable and more useful.

But, AFAICT, there is no way to record this in the WARCs. Perhaps we could use a new field that works a bit like WARC-Concurrent-To:

WARC-Transcluded-By: <urn:uuid:7e2b792d-3048-4c72-9b3e-c32db0e30091>

Which indicates that this WARC record was found to be transcluded by the item with the response with the given WARC-Record-ID.

anjackson avatar Sep 25 '14 15:09 anjackson

What if you found a resource via URL a, but it is also required by URL b?

kris-sigur avatar Sep 25 '14 15:09 kris-sigur

Yes, that would be a WARC revisit record with a WARC-Transcluded-By field, I think.

I realised getting H3 to support that might not be plausible, but I'd like to be able to support it in other tools.

anjackson avatar Sep 25 '14 15:09 anjackson

No I mean, you crawl a it leads to resource x which is than marked as 'already included'. The crawler then crawls b, extracts link to x and tosses it out since it's 'already included'.

Nothing gets written to the WARC about the relationship between b and x.

kris-sigur avatar Sep 25 '14 15:09 kris-sigur

That's what I mean. H3's frontier logic can't support this reliably. But warcprox could, for example.

anjackson avatar Sep 25 '14 15:09 anjackson

@anjackson Fair enough.

kris-sigur avatar Sep 25 '14 15:09 kris-sigur

How would this, practically speaking, be useful for replay or later use? It seems like this is similar to the Via metadata field that's already in use, but uses a warc record id, instead of the url, making it less useful. In fact, wasn't that the reason (lack of warc record-id indexs) that the WARC-Refers-To header was abandoned in favor of WARC-Refers-To-Target-URI / WARC-Refers-To-Date ?

ikreymer avatar Sep 29 '14 22:09 ikreymer

Just noting this proposal has some similarities to WARC-Push-Promised-From (#43).

In relation to Via: ~~You have to have an index on the record id to be able to find the metadata record containing Via anyway~~ [edit: not true. I guess we're going to need -URI and -Date versions of every new linking field]. One thing that's better about this proposal than Via is it identifies which specific version of the referring resource it came from.

It's kind of a pity record ids are opaque. If we used something like target-uri+date+seqno as the record id it would neatly solve the index problem. Perhaps that'd introduce a new set of problems though.

ato avatar Jul 17 '18 06:07 ato