warc-specifications Archival record standard for client-side archiving

The WARC standard is great for encapsulating the web archiving process which may occur over a network connection (eg. request/response). As the web becomes more complex, and more processing happens on the client side, it is becoming increasing useful to perform client or browser based archiving. In such scenarios, the HTTP request and response objects are not necessarily available, and standardization is needed for how best to represent such data.

One good example is simply capturing the final HTML DOM as rendered by a browser by storing the value of document.documentElement.outerHTML.

Indeed there are already tools which store the final HTML DOM, such as archive.today, and various browser proxy extensions, which have access only to the DOM.

A clear WARC record type/use case is needed to properly indicate such (and other) client side captures. Either an existing WARC record can be standardized, or perhaps a new warc record type for client side capture can be created. Most importantly, it should be possible to distinguish an over-the-wire record from a client/browser rendering of a particular resource.

The standard response record is inappropriate as it does not represent a raw HTTP response, but a rendered version of the response in a particular client. The name of the client should also be stored in the archival record.
Other standard record types that are possible include resource, metadata and even conversion. They all seem to not quite fit. resource perhaps is the closest, but may require a custom URI format to differentiate from regular network capture. metadata does not really fit as the record is not really metadata about a record, but a type of record. A case could be made for conversion, as the record is a conversion from raw HTTP response to a client side rendering, but still doesn't quite make sense.
A new custom, record type could be created to indicate this type of capture, maybe rendered-response Such a record could contain the following:
- The original url the record was retrieved from
- The time that the record was retrieved from
- The client (user agent) that rendered this response
- The time the response was rendered (in addition to retrieval time)
- Method used for obtaining record (eg. outerText, plugin, etc...)
- Possibly other transformation that were performed (this requires more thought). For example, if the page contains scrolling, and the user scrolled two pages, this could be recorded in such a record.
It probably makes sense to have a record type that is the plain DOM, since it makes it easily accessible and searchable, although the HAR format is another alternative and provides a way to store embedded content.
It is important to properly establish provenance of a client side resource as accurately as possible, since client side resources may vary from one client to another, eg. the same URL rendered in one browser, or for one user, may be quite different from same URL on another browser. Client side resources are also susceptible to (manual or accidental) changes by the user. It could be argued that client side resources are not usually not authentic as regular, network capture resources, but this need not be the case. (More thought is needed here).

Hopefully this is enough to start a discussion of this issue.

Sep 29 '14 22:09 ikreymer

Oops, I missed the included recording-screenshots.md page when I first posted this.

Thinking about that spec, I think both the HAR and a plain DOM record could be useful in different circumstances.

A HAR record may be useful for tools wishing to analyze the entire captured page at once, without having to load it.

One of the drawbacks of the HAR format is that the DOM text is not easiely available but must be decoded. A final DOM record may be more useful for analyzing text and more compatible with existing tools since it would look more similar to existing records.

Also, it would be more useful for client-side proxy archiving tools where embedded resources such as images are already been recorded in the WARC independently, and are accessible by url, and need not be included a second time as part of the HAR.

Sep 29 '14 23:09 ikreymer

Would the rendered-response record type include HTTP headers?

Oct 26 '15 21:10 ato

No, because this does not represent an HTTP response, but the current rendering. The HTTP headers may not be known (from JS in a browser usually they are not accessible), and no longer relevant (for example, the DOM may have been modified so that the original content-length is not correct), and there may indeed not even be an HTTP response (the html was constructed on a blank page with document.write)

An update on the status of this: Currently there is experimental rendered-response support in webrecorder, which uses the resource record, which seemed the closest of the existing record types, and by additionally adding a WARC-Json-Metadata: {"snapshot", "html"}

This generally works now, including recursive iframe saving, though one issue still remaining is what to do with dynamic iframes (that only have a url of about:blank). They could be saved with custom generated url, or embedded directly into parent using iframe srcdoc=

I suppose using the resource record with added metadata may be an adequate solution now to not warrant a separate record type, as long as there is consistent metadata to indicate the nature of the record.

Oct 26 '15 22:10 ikreymer