tower-http icon indicating copy to clipboard operation
tower-http copied to clipboard

Hash-based etag layer

Open casey opened this issue 2 years ago • 16 comments

I'm thinking about writing a hash-based etag layer.

My plan is to:

  • When a response is going to go out, hash the body with blake3 or xxhash
  • Add a strong etag header with the digest
  • If the request contains an If-Match with a matching etag value and the response code is 200, set the code to 304 and remove the body

I probably can't contribute what I wrote to tower-http, since I'm very busy with the project that this is for, but if I wind up writing it I can definitely post or link to the code, in case a tower-http contributor is interested in getting it into tower-http. Also, even if it doesn't get into tower-http, it might give some insight into the viability of the approach.

Some reasons why this might not be a good idea:

  • Calculating response body digests adds too much CPU load for not enough benefit
  • Replacing the body of responses interacts badly with content encoding, or other layers

Any thoughts on the viability of this approach are welcome, especially if it's been considered for tower-http. I've been looking around for other applications that automatically add etags using a hash function, and I haven't found many. This makes me wonder if the idea is actually wrothwhile.

casey avatar Jan 25 '23 21:01 casey

Yeah we could have something like that. The issue is that it requires buffering the whole response body in the middleware which doesn't work for streaming responses but for responses made using Full it should be fine.

davidpdrsn avatar Jan 27 '23 08:01 davidpdrsn

Our axum web server is now behind cloudflare, so this is basically at the bottom of my priorities list. I started working on it, but I want to make sure I get the most out of cloudflare before this, since there are downsides, like the CPU needed to hash and the buffering.

By the way, huge thank you to everyone working on tokio, tower, and axum, and the massive long tail of dependencies which enabled us to ship something great.

We are up to 13 million requests per day, around 150 qps, and we absolutely could not have done this without tower. Everything we do is open source and CC0, so everywhere it makes sense we want to upstream.

Thank you! ❤️

casey avatar Feb 08 '23 00:02 casey

We are up to 13 million requests per day, around 150 qps, and we absolutely could not have done this without tower. Everything we do is open source and CC0, so everywhere it makes sense we want to upstream.

I'm curious, who is 'we'?

jplatte avatar Feb 08 '23 06:02 jplatte

We are me and the small cadre of lunatics behind casey/ord, which is the block explorer and wallet for ordinals, a Bitcoin NFT protocol that has been getting popular recently. The main feed of new NFTs is here. Things are amazingly rough, but the response has been incredible.

casey avatar Feb 08 '23 08:02 casey

Oh 😐

jplatte avatar Feb 08 '23 08:02 jplatte

I think it could make sense to implement this for ServeDir, where the ETag could be computed without reading everything into memory by reading the file twice, cached or even precomputed like can be done for compression.

SvizelPritula avatar Jun 17 '23 20:06 SvizelPritula

Rather than hashing whole file, I would suggest to consider computing etag from last modified timestamp, length and optionally hash of file name

DoumanAsh avatar Jul 03 '23 05:07 DoumanAsh

Hashing the filename is useless, since renaming the file will result in it having a different URL and cache key anyway. Hashing the length doesn't help much either, as there are types of files whose length never changes, such as the index.html built by build tools that just insert a hash of scripts into it. Hashing the last modified time has the opposite issue: It can easily change without the file itself changing.

---------- Původní e-mail ---------- Od: Douman @.> Komu: tower-rs/tower-http @.> Kopie: SvizelPritula @.>, Comment @. github.com> Datum: 3. 7. 2023 7:24:13 Předmět: Re: [tower-rs/tower-http] Hash-based etag layer (Issue #324) "

Rather than hashing whole file, I would suggest to consider computing etag from last modified timestamp, length and optionally hash of file name

— Reply to this email directly, view it on GitHub (https://github.com/tower-rs/tower-http/issues/324#issuecomment-1617373309), or unsubscribe (https://github.com/notifications/unsubscribe-auth/AHYWLAODUGSC3UE62HPL3QDXOJJPRANCNFSM6AAAAAAUGZWSVA) . You are receiving this because you commented. Message ID: <tower-rs/tower- @.***> "

SvizelPritula avatar Jul 03 '23 06:07 SvizelPritula

Related: https://github.com/tower-rs/tower/issues/692. This seems like a pretty often requested feature

billythedummy avatar Sep 20 '23 08:09 billythedummy

I've started on an implementation at https://github.com/billythedummy/tower-etag-cache/tree/master/tower-etag-cache. The implemented ConstLruProvider probably won't work well for streaming responses due to https://github.com/tower-rs/tower-http/issues/324#issuecomment-1406199880 but the generic traits exported by the crate (hopefully) allows for better implementations. It currently seems to work ok with the examples/simple example axum server in the repo

billythedummy avatar Oct 04 '23 08:10 billythedummy

@billythedummy I am also interested in this feature, so I had a look at your implementation and example. Thank you for your work.

I tested the caching behaviour with the index.css file of your example app:

  1. Request the file index.css => 200 OK (Etag is returned)
  2. Request the file again (including if-none-match with matching Etag) => 304 Not Modified
  3. Modify the css file on the server
  4. Request the file again (including if-none-match with same Etag as in step 2) => 304 Not Modified

But in step 4, the server should return the new file (as it has been modified). Doesn't this defeat the purpose of the Etag middleware?

Weltpilot avatar Oct 14 '23 10:10 Weltpilot

@Weltpilot thank you for the review. I wanted to just do a basic proof-of-concept with the ConstLruProvider so it has a very simple model that assumes all resources are static and don't change throughout the lifetime of the server. You can build your own cache provider with different behavior - perhaps it invalidates certain entries when files in a specific directory change or perhaps it always reruns the request and recalculates the etag

billythedummy avatar Oct 15 '23 09:10 billythedummy

If you assume that all resources are static and don't change, why not simply always return "304 Not Modified", when the request contains a "If-None-Match" or "If-Modified-Since" header?

Weltpilot avatar Oct 16 '23 10:10 Weltpilot

@Weltpilot does not change throughout the lifetime of the server. What this enables:

  • user GETs index.html, server responds with document + etag abc
  • i modify the contents of index.html and push the update, restarting my server. The etag of index.html is now xyz due to the different content
  • user GETs index.html again with if-none-match abc. The new server does not see a index.html: abc entry in its cache and responds with the new index.html + etag xyz

billythedummy avatar Oct 17 '23 03:10 billythedummy

@billythedummy So, when you want to change a static file, you need to restart the server? Sorry, that doesn't make any sense to me. @jplatte, @davidpdrsn Maybe one of the maintainers can make a proposal how to implement an ETag middleware that has a chance of being accepted as pull request? I would opt for the suggestion by @SvizelPritula to extend ServeDir by an option to read the ETag from the file system, that has been generated in advance (similar to the compressed files).

Weltpilot avatar Oct 18 '23 09:10 Weltpilot

I think the best path is to develop it as a standalone crate, unless it specifically needs private API for tower-http, that we can’t just make public.

tower-http doesn’t get much maintenance these days (personally I’m busy with work and axum) so don’t wanna add more stuff to tower-http which further burdens maintainers.

davidpdrsn avatar Oct 18 '23 11:10 davidpdrsn