warc-specifications icon indicating copy to clipboard operation
warc-specifications copied to clipboard

Discussion of zstd and warcs

Open wumpus opened this issue 7 months ago • 21 comments

A discussion broke out in the End of Term Archive 2024 slack about zstd warcs. zstd is a relatively new compression format and should be discussed early and often before being adopted by the archive community. In the EOT 2024 case, Archive Team submitted many zstd warcs.

Proposal: https://iipc.github.io/warc-specifications/specifications/warc-zstd/

One comment brought up already is that the possible dictionary frame at the start of every warc might make playback slower.

Another comment is that the zstd dictionary frame is not a WARC record. That might not be a good choice when some (most?) warc tools don't support zstd, and will fail to correctly index or extract or replay any records from a zstd warc.

But the most important point that I'd like to make is that we should discuss these issues early and often, and not after petabytes of warc files are generated with a new format that was not discussed by the community.

wumpus avatar Sep 05 '25 05:09 wumpus

@ibnesayeed you had a positive comment about compression efficiency, I would love to have some examples.

wumpus avatar Sep 05 '25 05:09 wumpus

See also: https://github.com/iipc/warc-specifications/issues/53 (thanks @ikreymer)

wumpus avatar Sep 05 '25 05:09 wumpus

@ibnesayeed you had a positive comment about compression efficiency, I would love to have some examples.

We have done some real-world statistical analysis of a use-case where WARCs created by archiving pages from a single website at a significantly large scale have produced substantial storage savings. The reduction in size when using Zstandard compression with a dictionary w.r.t. the GZip compression ranged from 30% to 50% (ignoring outliers). The dictionaries were trained only on 1,000 WARC records and were a couple of hundreds of kilobytes in size to produce such a significant storage saving. Zstd compression without a dictionary yielded only about 1-3% savings w.r.t. GZip.

ibnesayeed avatar Sep 05 '25 13:09 ibnesayeed

Another comment is that the zstd dictionary frame is not a WARC record. That might not be a good choice when some (most?) warc tools don't support zstd, and will fail to correctly index or extract or replay any records from a zstd warc.

While I understand Zstd might not be a good fit for every occasion and it has limitations that can be discussed to seek creative solutions from the community, but this specific comment does not feel like a fair criticism in my opinion. It is true that most WARC tools have not added support for Zstd yet, so they will fail to consume such WARCs, but that would be true for any other compression algorithm, irrespective of the presence or absence of a dictionary. The dictionary frame in this case is part of the compression encoding layer, it is not attached to the uncompressed WARC file ahead of time, hence if the warc.zst file were to be uncompressed using a supporting tool, the result will be a pristine set of WARC records with no additional bits attached anywhere.

ibnesayeed avatar Sep 05 '25 13:09 ibnesayeed

One comment brought up already is that the possible dictionary frame at the start of every warc might make playback slower.

This is fair point, but there are practical ways to address this overhead. One possibility is to use an LRU cache for dictionaries, indexed by the file name/path, which can eliminate this overhead almost completely if the access pattern involves accessing WARC record from a finite set of WARC files. Another approach would be to GET multiple byte ranges in the same HTTP call, one for the dictionary from the beginning and one for the specific WARC record, then locally tease those pieces apart and decompress.

ibnesayeed avatar Sep 05 '25 13:09 ibnesayeed

Perhaps there advantages to wrapping the zstd dictionary in a WARC record because it means metadata about the dictionary can be added? For example, could a unique ID (or the hash) of each dictionary be used to speed up the dictionary caching process for playback?

anjackson avatar Sep 05 '25 13:09 anjackson

Perhaps there advantages to wrapping the zstd dictionary in a WARC record because it means metadata about the dictionary can be added? For example, could a unique ID (or the hash) of each dictionary be used to speed up the dictionary caching process for playback?

This is actually a very good idea, which opens the door for inclusion of multiple dictionaries in a WARC file or a separate WARC file just for dictionaries. The dictionary metadata records can be compressed without a dictionary.

ibnesayeed avatar Sep 05 '25 14:09 ibnesayeed

I think the choice to put a single dictionary at the beginning of the WARC file is not a great decision for several reasons:

  • It breaks the recommendation and introduces an incompatible change with existing Annex D Record-at-time compression particularly External indexes of WARC file content may then be used to record each record’s starting position in the GZIP file, allowing for random access of individual records without requiring decompression of all preceding records.
  • Practically speaking, it means the current well established paradigm of having the CDX index offset and length be enough to decompress a particular record is no longer the case. The reader must do an out-of-band request at the beginning of the WARC file to read a dictionary, for an an arbitrary amount of data, on possibly every read (in the worse case).
  • It makes certain very useful operations on WARCs, such as concatenating multiple WARCs together to form a new WARC, or splicing records from different WARCs to form a smaller WARC, impossible, because there may be multiple dictionaries across these WARCs and there is no clear way to combine them. This is a serious limitation of this approach.
  • It doesn't really make sense for compression-based reasons, as WARCs are not split by content type, but usually by thread/process that is writing them in the course of a crawl. A typical WARC may have heterogeneous content that may not compress well with the same dictionary. Having the same dictionary for all data in one WARC, and a different dictionary for all data in a different WARC is fairly arbitrary, given that they may just be written by different processes. Probably ideal compression arises from using one dictionary for HTML, another dictionary for CSS, JS, maybe no dictionary for already-compressed video, image files. But WARCs are not grouped by resource type in this way, so a one dictionary to apply to all WARC records is always going to be suboptimal.

I think a better approach, which is a compromise, would be to place the dictionary, if one is used at all, at the beginning of each WARC record. Yes, this will add some duplication. But:

  • A reasonable dictionary should be about 100K according to the official implementation. If adding 100K to a record is not worth the tradeoff, then perhaps a dictionary should not be used at all. If the file is so small that duplicating the dictionary has a negative impact, its probably not a good idea to use one.
  • This would allow for using type-specific dictionaries, eg. one for HTML, one for CSS, one for JS, etc.. maybe none for video/images (or maybe a type specific one that does work well) that could lead to better compression results across many different WARCs
  • It would avoid breaking existing paradigms with CDX, WARC concatenation and WARC record splicing.

WARC is an archival format that is written once and read many, many times (usually via random access for replay/search and sometimes linearly). The format should continue to be optimized for making the most common access pattern as efficient and simple as possible without breaking existing paradigms, while reducing size whenever possible.

ikreymer avatar Sep 05 '25 14:09 ikreymer

Perhaps there advantages to wrapping the zstd dictionary in a WARC record because it means metadata about the dictionary can be added? For example, could a unique ID (or the hash) of each dictionary be used to speed up the dictionary caching process for playback?

This could work as well, as this would at least solve the WARC concatenation/splicing records by having these records be indexed in the CDX and also accessible. At least there is precedent for requiring a secondary WARC record to be available to fully read another WARC record with revisit records, so this could work in a similar way (compared to requiring arbitrary reads at the beginning of a file). I still think just putting the dictionary at the beginning of each WARC record would be the simplest/most efficient solution, but would not be opposed to this approach.

ikreymer avatar Sep 05 '25 14:09 ikreymer

I think a better approach, which is a compromise, would be to place the dictionary, if one is used at all, at the beginning of each WARC record. Yes, this will add some duplication. But: -- @ikreymer

I don't think this would be a good idea because this will not only add duplicate data, but also increase the size of the file significantly. Dictionaries yield any significant advantages only when they are reused in multiple blocks with similar data in them.

ibnesayeed avatar Sep 05 '25 15:09 ibnesayeed

I think a better approach, which is a compromise, would be to place the dictionary, if one is used at all, at the beginning of each WARC record. Yes, this will add some duplication. But: -- @ikreymer

I don't think this would be a good idea because this will not only add duplicate data, but also increase the size of the file significantly. Dictionaries yield any significant advantages only when they are reused in multiple blocks with similar data in them.

To be clear, I'm not suggesting each record should have a unique dictionary, but that the same dictionary should be included at the beginning whenever it is used (so it can be parsed by standard zstd which expect a dictionary at the beginning). Eg. if a dictionary is used 100 times, the same 100K bytes are added for each record, adding 10MB. But presumably the savings would outweigh that.. So yes, the tradeoff would just be the duplication of the same dictionary bytes for each record, but otherwise still benefit of same dictionary wherever it used. I'm imagining a bunch of HTML records might share one dictionary, JS files might share another, HTML records from another site have a different dictionary, etc... This would provide for flexibility for experimenting with multiple dictionaries across records without breaking existing conventions.

It would be good to get some actual numbers to look at, what the sizes of the dictionaries are, and how much savings they provide..

ikreymer avatar Sep 05 '25 15:09 ikreymer

I think wrapping the dictionary in a WARC record addresses that part what I pointed out.

wumpus avatar Sep 05 '25 17:09 wumpus

@wumpus

zstd is a relatively new compression format and should be discussed early and often before being adopted by the archive community.

At what point should we consider an algorithm mature? While relatively new compared to gzip, the original zstandard RFC is 7 years old and the open-source zstd implementation is 9 years old.

But the most important point that I'd like to make is that we should discuss these issues early and often, and not after petabytes of warc files are generated with a new format that was not discussed by the community.

Respectfully, the zstandard compression draft was discussed and merged 5 years ago - https://github.com/iipc/warc-specifications/pull/69 - after which petabytes of WARCs have been written in compliance with a community-drafted specification. I disagree with the claim that zstandard went undiscussed before being utilized.

EDIT: I want to clarify that my comments are unrelated to any ongoing discussions related to EOT 2024, and I'm trying to understand your concerns from the point-of-view of the WARC spec 😄

willmhowes avatar Sep 05 '25 19:09 willmhowes

Respectfully, the zstandard compression draft was discussed and merged 5 years ago - https://github.com/iipc/warc-specifications/pull/69 - after which petabytes of WARCs have been written in compliance with a community-drafted specification. I disagree with the claim that zstandard went undiscussed before being utilized.

The proposal was a good start, but it was far from accepted / disucussed. Perhaps the issue was closed to indicate a draft was proposed and added, not that it was in any way accepted.

Note that the draft even states (emphasis of the author):

Words of caution

This specification is experimental and subject to change. Furthermore, the Zstandard format itself is much less mature than GZIP. The GZIP RFC was published in 1996; the Zstandard RFC was published in 2018. Before using Zstandard compression for archival, organizations should carefully consider the risks of relying on such a young format.


The author also pointed out that the draft was made to match what ArchiveTeam was already doing: https://github.com/iipc/warc-specifications/issues/53#issuecomment-704109978 and was designed to be a starting point.

I don't think its fair to say the draft was ever discussed or accepted by the community, outside of ArchiveTeam.

Given the experimental nature of the proposal, generating petabytes of WARCs on an experimental spec/subject to change is an odd decision...


The open-source tooling around how to even index a WARC ZSTD file leave much to be desired.

A gzip WARC can be extracted with gunzip file.warc.gz which is universally available on most platforms.

Unfortunately, unzstd file.zstd does not just work (I expected the dictionary frame to be parsed automatically but this is not the case).

This stack overflow issue suggests a custom python script based on an old version of CDX-Writer that is needed to even decompress the WARC.ZST. To use it, it requires version of python zstandard library from around 2020-2021.

The current version of CDX-Writer includes an unreleased patch from IA to the python zstandard library, presumably as its changed and no longer exposes internals that make indexing possible (opened issue about this here: https://github.com/internetarchive/CDX-Writer/issues/41)

This does not suggest the tooling is in a mature state to consider standardization.


After installing the old version of python zstandard, I checked three .warc.zst 'megawarcs' at random the EOT2024 and then recompressed them with warcio recompress.

The resulting .warc.gz was actually smaller than the .warc.zst in all three cases (by between 50-200MB out of 10-19GB)!
All the WARCs tested had the same 1K dictionary - perhaps that's too small?

Now I know that's not a representative sample and I don't doubt that zstd can offer significant benefits if used properly, but it's all the more reason to consider the tradeoffs more carefully before adopting zstd for wide use.

ikreymer avatar Sep 05 '25 20:09 ikreymer

@willmhowes are you saying we shouldn't discuss it now? I was working in radio astronomy when this proposal was introduced, and it has not come up until now since I returned to web archiving. I'm in favor of having a great process for discussing standards additions, including all stakeholders. IA is a stakeholder. CCF is a stakeholder. There are many more archiving stakeholders.

Our first attempt to discuss changes (http2 headers) didn't go well. Hopefully we can do better for this one.

wumpus avatar Sep 05 '25 21:09 wumpus

the same dictionary should be included at the beginning whenever it is used (so it can be parsed by standard zstd which expect a dictionary at the beginning)

I think the problem with including the dictionary in every record is it defeats the purpose of using a dictionary (reducing file size by sharing common data). I would expect storing a copy of the dictionary with each record to pretty much always be larger than just not using a dictionary.

As far as I can tell from the manpage and RFC, standard zstd doesn't expect a dictionary at the beginning of a stream, it only accepts one as an out of band option. The dictionary at the beginning of stream mechanism was defined by the WARC proposal, and has not been merged into zstd (yet). That's why you can't just unzstd the existing warc.zst files.

Perhaps the issue was closed to indicate a draft was proposed and added, not that it was in any way accepted.

I merged PR #69 to make it available on the website for people to read as formatted HTML. I think GitHub then auto-closed #53 because of the magic wording. The merging of the PR was not intended to convey anything about the status of the proposal. Apologies if it unintentionally did.

Perhaps there advantages to wrapping the zstd dictionary in a WARC record because it means metadata about the dictionary can be added? For example, could a unique ID (or the hash) of each dictionary be used to speed up the dictionary caching process for playback?

I'm a bit confused what this wrapping achieves. I can see it would give the dictionary a WARC-Record-ID. But suppose at replay time you're looking up a response record that's been dictionary compressed, how would you determine the location of the dictionary to load? You can't just refer to the dictionary WARC-Record-ID in the response record's WARC header because you need the dictionary to decompress the WARC header.

[Edit: Oh, is the idea with wrapping that you'd have multiple WARC files with identical dictionaries? Then you'd still have to read the start of each file to determine which dictionary it was but could use a digest header to avoid reading the payload if you happened to have it in memory from another file?]

If we want to allow multiple dictionaries and easy concatenation, it seems to me you could keep the current proposal basically as is but add another skippable frame at the start of each compressed WARC record that says "Hey, the dictionary I was compressed with is located N bytes before me in this file." Since the pointer to the dictionary is stored as a relative back-reference concatenating files doesn't break it. Maybe include the length of the dictionary too so that if you're accessing a remote WARC with HTTP range requests you can just send one extra request and get the entire dictionary. [Edit: But this does make properly combining files with identical dictionaries a little harder as you have to update all the back-references to remove the redundant dictionary. I'm also not yet convinced multiple dictionaries are worthwhile.]

ato avatar Sep 06 '25 06:09 ato

Warcat supports both compressing and decompressing warc.zst files as of version 0.3.0 but reading the encoder I can't work out whether it's compressing each warc record with a different dictionary. @chfoo how do you approach this??

extua avatar Sep 10 '25 09:09 extua

Warcat supports both compressing and decompressing warc.zst files as of version 0.3.0 but reading the encoder I can't work out whether it's compressing each warc record with a different dictionary. @chfoo how do you approach this??

No, it doesn't compress each record with a different dictionary. It follows the proposed spec as described, where headers and frames are manually parsed or processed using low level functions along with a state machine to ensure there is only one dictionary frame at the start of the file. The CLI interface currently doesn't expose a way for users to supply a dictionary, so encoding is done without a dictionary.

chfoo avatar Sep 10 '25 13:09 chfoo

Worth noting, there's a recently adopted standard for 'shared dictionary' and dictionary based content negotiation: https://developer.chrome.com/blog/shared-dictionary-compression https://developer.mozilla.org/en-US/docs/Web/HTTP/Guides/Compression_dictionary_transport So far only adopted in Chromium, but maybe some interesting ideas here that could be applied to WARCs. Perhaps semantics like Dictionary-ID and Available-Dictionary could be used, with WARC records storing dictionaries.

This also brings up the idea of a possible 'shared dictionary' that could be used for WARC + headers, and possible HTTP headers, since those generally all look the same. Revisiting compression is perhaps an opportunity to evaluate the current approach of compressing WARC headers + HTTP headers + payload together that is done for gzip.

For example, there could be a 'well-known' standardized dictionary for WARC + HTTP headers, while the WARC record HTTP payload dictionary can be specified in the headers, similar to the mechanism above. This could also address the long standing issue for not being able to seek directly into larger, already compressed payloads (like video/audio), without decompressing from the beginning, and could offer improvements for access as well as storage.

Very rough ideas on how this could work:

  1. WARC headers + HTTP headers decompressed using a 'well-known' dictionary (would need to figure out how to train in but there's lots of data out there for this)
  2. The WARC headers contain a WARC-Payload-Encoding: zstd and an optional WARC-Use-Dictionary which contains the hash of the dictionary to use for the payload, or something WARC-Payload-Encoding: identity (or header absent?) to indicate uncompressed payload (eg. might use this for video to allow fast seeks).
  3. If WARC-Payload-Encoding: zstd is provided, decode payload with zstd. If WARC-Use-Dictionary is provided, the dictionary with that hash needs to be used.
  4. The dictionary can be provided with a dictionary record that is just a special case of a resource record that contains the binary dictionary, and could be indexed as a resource records. Since it won't have a URI, perhaps its hash is to be used as URI in the CDX (or some other workaround), or perhaps something like Dictionary-ID is allowed for arbitrary IDs, etc..

ikreymer avatar Sep 13 '25 23:09 ikreymer

This could also address the long standing issue for not being able to seek directly into larger, already compressed payloads (like video/audio), without decompressing from the beginning, and could offer improvements for access as well as storage.

This is a good point, and we could address this in general by allowing multiple compressed blobs for every compression method that supports concatenated blobs. This change will break readers, though, because it's often the case that the code that reads compressed blobs needs to loop. Python's gzip library doesn't do that by default, you have to add a loop. But it would be cool to be able to efficiently seek deep into a video during playback.

wumpus avatar Sep 14 '25 04:09 wumpus

This is a good point, and we could address this in general by allowing multiple compressed blobs for every compression method that supports concatenated blobs. This change will break readers, though, because it's often the case that the code that reads compressed blobs needs to loop. Python's gzip library doesn't do that by default, you have to add a loop. But it would be cool to be able to efficiently seek deep into a video during playback.

Yeah, that's why was suggesting this general approach along with changes to support zstd, since readers would of course need to be updated to handle zstd. The reader would need to peek the bytes to determine what encoding is used to support zstd, and then could allow a WARC-Payload-Encoding: gzip or WARC-Payload-Encoding: identity even without using zstd. Would want to require this header to indicate that the payload is encoded separately to avoid any ambiguity, eg. the current behavior is still assumed (entire WARC record is in one gzip block) if the new header is not present.

ikreymer avatar Sep 14 '25 07:09 ikreymer