warc-specifications icon indicating copy to clipboard operation
warc-specifications copied to clipboard

Discussion of HTTP/2 use in WARC 1.1

Open CorentinB opened this issue 3 months ago • 14 comments

This discussion about a (perfectly valid) use of Zstandard for WARCs made me reflect on some longstanding problems in the web archiving community. For years, many web crawlers have failed to comply with the WARC specifications. If we want to build a stronger, more reliable foundation for web archiving, it is time we address these issues openly.

This post focuses on a particularly troubling practice: web crawlers that modify WARC records to "support" HTTP/2 archiving.


Example 1: Common Crawl’s Nutch

Common Crawl’s Nutch crawler rewrites HTTP/2 responses as if they were HTTP/1.1, thereby falsifying the captured records.

  • Commit introducing the falsification mechanism: https://github.com/commoncrawl/nutch/commit/5f4369298cbedf85572514d5f97a346935e338f0
  • Public discussion of this approach: https://github.com/commoncrawl/nutch/issues/29

The result? Petabytes of falsified WARCs. (unless I am mistaken and Common Crawl uses another, spec compliant, crawler)


Example 2: Heritrix' FetchHTTP2 module:

This module enables kind of the same thing as the Nutch code. I know for a fact that it was not used by the Wayback Team when I was there. (it was actually created after I left) Someone from there could confirm if it's actually used or not. (AFAIK this module is luckily not enabled by default..)


Example 3: Storm Crawler

Storm Crawler is much smaller in scope than Nutch, so I hope it has not been widely adopted by the archiving community. However, it too produces bad WARC records.

As their own README states:

“The WARC file format is derived from the HTTP message format (RFC 2616) and the WARC format as well as WARC readers require that HTTP requests and responses are recorded as HTTP/1.1 or HTTP/1.0. Therefore, the WARC WARCHdfsBolt writes binary HTTP formats (eg. HTTP/2) as if they were HTTP/1.1. There is no need to limit the supported HTTP protocol versions to HTTP/1.0 or HTTP/1.1.”

In other words: instead of preserving the response as-is, the crawler rewrites it into something it never was. Worse, Storm Crawler also deletes certain HTTP headers outright and modifies responses because it cannot handle them faithfully. (That broader problem deserves its own discussion.)


I hope this summary helps spark a serious conversation within the web archiving community. Some may argue that aspects of the specification should evolve—and of course MAYBE they should! All conversations are welcome.

But please, stop creating falsified HTTP records. Stop claiming compliance with the WARC specification when it is clear you are not. The credibility of our archives depends on it, and if we want a reliable future for web archiving, we must start by respecting the specifications we already have. Let's work together to properly support HTTP/2 (and 3) in the next version of the WARC spec.

CorentinB avatar Sep 05 '25 22:09 CorentinB

Appreciate your politeness.

wumpus avatar Sep 05 '25 23:09 wumpus

Appreciate your politeness.

You’re welcome. Could you please share your point of view on the issues I’ve raised here?

CorentinB avatar Sep 05 '25 23:09 CorentinB

I did share my point of view.

wumpus avatar Sep 05 '25 23:09 wumpus

My goal when web archiving is to preserve web resources, not network messages. Therefore I consider translating or changing transport-level message headers for implementation practicality acceptable provided the semantics needed for replay are preserved. I do think it's important to record that such a transformation has happened though, hence the WARC-Protocol proposal (#42).

HTTP/2 was intentionally designed to have the same semantics as HTTP/1.1 and I haven't seen any problems from the translation but would like to hear about them if anyone has. It's also possible that will change at some point as more features are added to HTTP/2 and 3.

FetchHTTP2 is explicitly documented as not recording the original wire message and as far as I know it and Heritrix more broadly don't make any particular claims about WARC compliance. Nonetheless I don't want to by omission give a false expectation to users who find exact preservation of network messages important. I'll update the FetchHTTP2 documentation to note that recording of HTTP/2 is not currently defined by the base WARC standard and recommend sticking to the original FetchHTTP module if you require network messsage preservation or strictly adhering to the base standard without extension.

The WARC specification itself uses language like "should", "where possible" and "implementation limits" when referring to the recording of network protocol information so it does seem not so clear cut to me that this practice should be considered non-compliant, but I do think it's important to be transparent.

I would be happy to see an effort to standardize the recording of HTTP/2 binary protocol by those who are interested in it. I guess the reason nobody (to my knowledge) has so far is that it seems like it would make both recording and replay more difficult and break compatibility with existing tools for limited practical benefit. Perhaps there are solutions for that though.

ato avatar Sep 06 '25 00:09 ato

@ato first and foremost thanks for the thoughtful explanation and for emphasizing transparency. I really appreciate you taking the time to discuss this, I really do!


My goal when web archiving is to preserve web resources, not network messages. Therefore I consider translating or changing transport-level message headers for implementation practicality acceptable provided the semantics needed for replay are preserved.

WARC’s request/response records are meant to hold the actual HTTP messages "received over the network, including headers" (for responses) and "sent over the network, including headers" (for requests). That part of the specification is very clear.

When you (and others!) rewrite the content of the response, it is not compliant to the spec. Removing something that was there (like the HTTP/2 line) and adding a new thing (HTTP/1.1 line) and even in some cases like Nutch adding new HTTP headers, or Webrecorder's tool completely mangling headers in some cases, this is all way out of the "a ‘response’ record block should contain the full HTTP response received over the network, including headers" part of the spec.

That’s the thing users expect to be preserved "where possible." Rewriting an HTTP/2 exchange into HTTP/1.1 syntax changes the message representation, which makes it a transformation rather than the original capture.


The WARC specification itself uses language like "should", "where possible" and "implementation limits" when referring to the recording of network protocol information so it does seem not so clear cut to me that this practice should be considered non-compliant, but I do think it's important to be transparent.

When the spec’s "where possible" and "implementation limits" apply (e.g., software bugs or network issues), writers may use “best effort" boundaries, but the target remains the full HTTP message as received/sent over the network, including headers. Protocol is not such a limit: you can easily choose to only do HTTP/1.1.


I'll update the FetchHTTP2 documentation to note that recording of HTTP/2 is not currently defined by the base WARC standard and recommend sticking to the original FetchHTTP module if you require network messsage preservation

I really appreciate that!


I would be happy to see an effort to standardize the recording of HTTP/2 binary protocol by those who are interested in it.

I know for a fact multiple persons that would gladly have an active interest in making this standard evolve. As some of you may know I, with many other, maintain the Zeno crawler, which leverages gowarc, a library to write WARC files out of HTTP traffic. I know for a fact that any of the active contributors of Zeno would 100% like to take active part in a discussion to make the standard evolve for HTTP/2, we all need it. We're tired of sticking to HTTP/1.1! But we want to respect the specifications.

I don't want to speak for them of course, but I would assume that the maintainers of wget-at would also love to see that happen.

Them and us (Zeno maintainers) try as much as we can to stick to the spec, which means we love the spec.. which means we want to see it evolve for the modern world. :)

And just to be clear this is all me talking, I do not talk on behalf of anyone else

CorentinB avatar Sep 06 '25 09:09 CorentinB

@wumpus sorry I am confused, what do you mean? I do not see any point of view shared here. Only 2 messages as followed:

Appreciate your politeness.

I did share my point of view.

Maybe you forgot to send the one where you share your point of view? It would be very interesting to hear considering your crawler is at the heart of the discussion.

CorentinB avatar Sep 06 '25 09:09 CorentinB

Good argument. I concede if one interprets 'should' as a strict requirement (which is most likely correct given the other usages in the spec) then indeed WARC 1.1 prohibits storing anything other than a network received HTTP/1.1 compatible message in a response record with a 'http' or 'https' URI. Recording a HTTP/2 binary message would also not comply and so for strict compliance you must indeed either stick to HTTP/1.1 or use a 'resource' record.

Now that you've convinced me I'm a recalcitrant rapscallion who doesn't truly love the spec, how do you think it should be updated to support HTTP/2? :-)

ato avatar Sep 06 '25 11:09 ato

Now that you've convinced me I'm a recalcitrant rapscallion who doesn't truly love the spec, how do you think it should be updated to support HTTP/2? :-)

Haha! I'm actually 100% sure you love the spec, else you wouldn't even be having this conversation with me!

I don't have a clear idea on how the spec should evolve yet, I had many discussions about it with many smart people over the years but at the end I am absolutely not saying that I am the most qualified to find a good path forward.

But what about we list the active actors (people who write and use WARC at scale) and figure out a way to get together and start a movement to actually work on this?

CorentinB avatar Sep 06 '25 15:09 CorentinB

@CorentinB, I wouldn't call Common Crawl's WARC files "falsified". The semantics of the HTTP message are preserved, while the literal bytes are modified. This can be the case even for HTTP/1.1 WARC response records because content and transfer encodings are decoded to make the HTTP messages easier to consume. This can be a challenge for WARC parsers, some do not support every encoding (e.g. Brotli), others even do not support any encoding at all. Easiness of use and back-ward compatibility were the main motivation (details here) for the current solution in Nutch and Stormcrawler how HTTP/2 captures are stored. The decision was a very pragmatic one: keep the changes minimal in order not to break code of WARC consumers, but improve the performance of the crawler. To emphasize it again: it's only about the form not the semantics. Otherwise a reprint of a book would be already called a "falsification".

The WARC format is inspired by the HTTP 1.x message format. HTTP/2 does not change the overall semantics: HTTP headers are still a list of key-value pairs. The changes are about performance. Storing HTTP/2 literally in WARC records has a major challenge: with the HPACK header compression to decompress the headers of one request/response pair you need the headers of preceding requests and responses to/from the same server.

But the major headache is backward-compatibility: there are many WARC parsers. Adapting any significant part of it to a new format would require a lot of work.

sebastian-nagel avatar Sep 09 '25 17:09 sebastian-nagel

To emphasize it again: it's only about the form not the semantics. Otherwise a reprint of a book would be already called a "falsification".

@sebastian-nagel thanks for your answer. I do agree that the word is strong, I am not sure I know an english word that would perfectly reflect what I want to say and that would be softer.

BUT, Nutch also rewrite and add new HTTP headers. Like Content-Length or all the X- stuff it adds. So.. if you start changing words in a book and sell it as the original, then yes, for me, it's falsification.

I am open to use a better word if you have one though!

Adapting any significant part of it to a new format would require a lot of work.

Yes, but I'm not sure any of us is afraid of doing some work for that, no?

CorentinB avatar Sep 09 '25 17:09 CorentinB

Storing HTTP/2 literally in WARC records has a major challenge: with the HPACK header compression to decompress the headers of one request/response pair you need the headers of preceding requests and responses to/from the same server.

I guess you could record the header block with the HPACK references to previous messages resolved, although Corentin would presumably call that 'falsified'. And if we're going to do that then why not record it fully decoded, perhaps even in... HTTP/1.1 syntax. :-)

I guess you could record the HPACK-encoded header block but also separately record enough information (i.e. the relevant dynamic table entries) to allow decoding. But this means every reader not only has to implement HPACK but also a way to preload the table which may preclude using off the shelf implementations that don't expose a way to do this.

My personal preference for the next revision of the standard is to make recording translated to HTTP/1.1 syntax with transfer-encoding stripped not just allowed but recommended. And then have ways to record any transport details that matter separately. For example if you think preserving the original HTTP/2 header block is important you could link it as transport metadata. That way readers who don't speak HTTP/2 can still access the data but extra details can still can be recorded for those readers who do understand it.

One similar extra detail I do think would be useful to be able to record is the timing when individual data chunks are received. For example, imagine a page that uses server-sent events (a long running HTTP response that slowly trickle new data over time) to apply real-time updates, such as live updating scores of a sports match or the chat associated with a video live-stream. Without more detailed timing information you can't accurately replay this.

ato avatar Sep 10 '25 00:09 ato

although Corentin would presumably call that 'falsified'. Ok just to be clear once and for all, because I can see my use of the word touched some nerves, which I can understand: what I call "falsified" (still open for a better word, sinc I'm ESL) is when we do something to the data that is CLEARLY changing the representation of what it is. Like rewriting HTTP 1.1 with HTTP 2. Or adding HTTP headers, or removing them, or changing their value. (Nutch does that a lot)

I have NOTHING against a clear solution where we actually do some modifications to the way we store data so that it's easier to use it, but it juste needs to be clearly explained what/why and how, in the spec, and I personaly only find it acceptable if we do not lose any meaningful information. (like headers, or "which HTTP protocol was used by the server")

My personal preference for the next revision of the standard is to make recording translated to HTTP/1.1 syntax My problem with that is that we loose the information about which HTTP protocol was used by the site. And we start touching the HTTP response, which, until now, we had no reason to touch. A spec-compliant WARC writer right now can simply write the data as it is received from the pipe, I understand HTTP 2/3 add so many challenges, but I think we should be as close as possible to such behaviour.

I don't think that more computing power to playback the records is such an issue, the WARC spec was made like 20 years ago, and we have so much much much much more compute power nowadays that I think we can make playback more resource intensive for HTTP 2 resources to allow for us to still archive the web as close as possible to what it is.

One similar extra detail I do think would be useful to be able to record is the timing when individual data chunks are received. For example, imagine a page that uses server-sent events (a long running HTTP response that slowly trickle new data over time) to apply real-time updates, such as live updating scores of a sports match or the chat associated with a video live-stream. Without more detailed timing information you can't accurately replay this.

100%

CorentinB avatar Sep 10 '25 10:09 CorentinB

My personal preference for the next revision of the standard is to make recording translated to HTTP/1.1 syntax

I'd second this. I see no chance that a substantial part of the WARC readers will adapt quickly to a new storage format for HTTP/2 or HTTP/3 captures.

Adapting any significant part of it to a new format would require a lot of work.

Yes, but I'm not sure any of us is afraid of doing some work for that, no?

Some work: yes. However, there are other features (WARC zstd compression, for example) which may yield in more benefits given the same investment of work. And again: it's not only about the spec but also about uplifting multiple WARC readers and writers.

with transfer-encoding stripped not just allowed but recommended.

Luckily, HTTP/2 has dropped chunked transfer-encoding (RFC 7540, section 8.1) and "gzip" (and other) transfer-encoding is rarely used.

only find it acceptable if we do not lose any meaningful information. (like headers, or "which HTTP protocol was used by the server")

And that's how it's done: The WARC-Protocol field holds the HTTP version. The original headers (Content-Length, Content-Encoding and Transfer-Encoding) are preserved and prefixed by X-Crawler-. If the payload length has changed, a new Content-Length header is added.

sebastian-nagel avatar Sep 10 '25 13:09 sebastian-nagel

And that's how it's done: The WARC-Protocol field holds the HTTP version. The original headers (Content-Length, Content-Encoding and Transfer-Encoding) are preserved and prefixed by X-Crawler-. If the payload length has changed, a new Content-Length header is added.

So you modify the HTTP headers, that means you loose the original headers. What happens if the server sent one of the headers you add to its response? You overwrite it I suppose?

I don't understand how it's possible to defend that modifying headers and adding new ones is a good idea.

The record is supposed to represent what was sent to you by the server, you shouldn't modify it.

CorentinB avatar Sep 10 '25 13:09 CorentinB