warc-specifications
                                
                                 warc-specifications copied to clipboard
                                
                                    warc-specifications copied to clipboard
                            
                            
                            
                        WIP: Document practices that may benefit from standardisation
We have a few cases where different tools are implementing shared use cases in slightly different WARC record structures. The purpose of this issue is to collect information on these variations so we can at least document their usage and prevent any further unnecessary variation. Understanding current usage should also set the stage for standardisation.
Crawl-time rendering artefacts
A number of organisations are now running web browsers during the crawl, and this provides an opportunity to preserve more information about how a site looked at the time it was captured.
| WARC-Type | Content-Type | WARC-Target-URI | Tool | 
|---|---|---|---|
| resource | application/pdf,text/html,image/png | urn:X-wpull:snapshot?url=<ENCODED_URL> | wpull Also stores a WARC-Concurrent-Topointer to a snapshot action metadata record | 
| resource | image/jpeg | screenshot:<CANONICAL_URL> | Brozzler code | 
| resource | image/jpeg | thumbnail:<CANONICAL_URL> | Brozzler code | 
| resource | image/jpeg | screenshot:<URL> | UKWA code | 
| resource | application/pdf | pdf:<URL> | UKWA code | 
| resource | image/jpeg | thumbnail:<URL> | UKWA code | 
| resource | text/html; charset="utf-8" | imagemap:<URL> | UKWA code | 
| resource | application/json | har:<URL> | UKWA code | 
| resource | text/html | onreadydom:<URL> | UKWA code | 
| resource | image/png | urn:view:<URL> | browsertrix-crawler | 
| resource | image/png | urn:fullPage:<URL> | browsertrix-crawler | 
| resource | image/jpeg | urn:thumbnail:<URL> | browsertrix-crawler | 
| conversion | text/html; charset="utf-8" | <URL> | crocoite* | 
| conversion | image/png | <URL> | crocoite* | 
| UMBRA? | 
*Note that crocoite uses additional record headers to indicate the type of the conversion record, e.g. X-Crocoite-Type': 'dom-snapshot
Web A/V Capture
| WARC-Type | Content-Type | WARC-Target-URI | Tool | 
|---|---|---|---|
| metadata | application/vnd.youtube-dl_formats+json | metadata://<AUTHORITY_AND_RESOURCE> | wpull, Heritrix3 ExtractorYoutubeDL module, Old Webrecorder | 
| metadata | application/vnd.youtube-dl_formats+json;charset=utf-8 | youtube-dl:<CANONICAL_URL> | Brozzler code | 
| resource | as found | youtube-dl:<PLAYLIST_INDEX>:<WEBPAGE_URL> | Brozzler code | 
| Webrecorder | 
Crawl Logs
At UKWA we consider our crawl logs to be important artefacts, but we don't put them in WARC. Maybe we should?
| WARC-Type | Content-Type | WARC-Target-URI | Tool | 
|---|---|---|---|
| resource | text/plain | urn:X-wpull:log | wpull | 
| metadata | application/json | ? | crocoite | 
EDIT 2023-10-18: Updated with notes from comments.
For reference, crocoite is using conversion records to store screenshot and DOM snapshot and metadata records log entries, see https://github.com/PromyLOPh/crocoite/blob/master/crocoite/warc.py#L216 and https://github.com/PromyLOPh/crocoite/blob/master/crocoite/warc.py#L201 and https://github.com/PromyLOPh/crocoite/blob/master/crocoite/warc.py#L236
Updating with Webrecorder's current practices for screenshots:
Crawl-time rendering artefacts
| WARC-Type | Content-Type | WARC-Target-URI | Tool | 
|---|---|---|---|
| resource | image/png | urn:view:<URL> | browsertrix-crawler | 
| resource | image/png | urn:fullPage:<URL> | browsertrix-crawler | 
| resource | image/jpeg | urn:thumbnail:<URL> | browsertrix-crawler | 
I've attempted to update this with the information you provided, @PromyLOPh @tw4l .