drf-extensions
drf-extensions copied to clipboard
Correct implementation of ETAGs in drf-extensions
In reference to the discussion in #7, I would like to open a new ticket based on the discussion started there by @mbox:
@chibisov first of all thank you for the amazing work you guys have done with drf-extensions! Especially I find the implementation done for cache key constructors & bits very elegant!
Even though on the surface it looks like using the cache key constructor and bits logic can be reused for generating ETAGS, I have to fully agree with what @mbox discusses in #7: the ETAG implementation within drf-extensions is currently incorrect. Let me elaborate further:
Current Issues
Looking at the example given in the documentation (http://chibisov.github.io/drf-extensions/docs/#http-etag) there are a two issues which need to be addressed:
1. ETAG calculation
In the example an etag is calculated based only on information provided by the request. In this case the number of args and kwargs that the view method received. If any new city is added to the database then regardless of this fact the same etag is generated and a 304 not modified will be returned. Also when any cache is invalidated still this same etag will be used and a 304 returned. This is not correct.
The ETAG here should have been generated here for example by generating a hash over the city names returned by the query, something like: hashlib.md5(str(City.objects.all().values_list('name', flat=True))).hexdigest()
or more easily (as a default) just a hash of the entire response body.
2. Use of strong ETAG's
A strong ETAG is generated currently, this allows a client to use the generated etag to do a byte-range request (https://tools.ietf.org/html/rfc7233) since a strong ETAG indicated byte-by-byte equivalence. The query used in the example in the documentation doesn't explicitly define any order: City.objects.all().values_list('name', flat=True)
. In this case a strong ETAG would be possible when the query is changed to City.objects.all().order_by('name').values_list('name', flat=True)
or of course when the model itself defines a default ordering. In more complex API's IMHO it is only possible to return a strong ETAG when the ETAG is calculated over the entire response body.
If the calculated ETAG is not calculated over the entire response a weak ETAG should be returned (etag prefixed by W/) this indicates semantic equivalence but not byte-by-byte equivalence. And thus only allows the ETAG to be used in conditional get requests (https://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html#sec13.3.3). Which is the intention of drf-extensions etag implementation anyway.
Proposal
1. Change the default implementation of generating ETAGs in drf-extensions to calculate the ETAG over the response body.
This means that a response always needs to be generated. But is the only correct default implementation possible that works in all cases. And still there is the added advantage of being able to return a 304 Not Modified and not transferring all data to the client.
Also update the documentation to mention that it is important for entities to have a default ordering.
Since this default implementation calculates the hash over the entire response body, a strong ETAG is returned in this case.
Nice to have: use a fast non cryptographic hash function instead of MD5 like Google's cityhash.
2. The default behaviour can be short circuited when an implementation per entity is done explicitly. This uses the current logic.
In this case there are a number of possible ways to generate an ETAG that correctly returns a different ETAG when anything has changed. An ETAG in this case could be generated by using the Django ORM to query the db for a subset of data normally used. For example generate an ETAG based on all id's and updated_at fields in the database. Or generate the ETAG on a value cached in a django cache. This effectively prevents having to render the entire response.
In this case a Weak ETAG should be returned (ETAG prefixed by W/). We cannot guarentee any byte-by-byte equivalence, but we can state that the entity representation is still semantically equivalent.
I have already done an implementation of this in another project. I would be happy to do an implementation in a PR of the idea proposed in this ticket. But before I spend serious time on this I want to be sure that @chibisov agrees with this line of thought.
Thanks!
In the example an etag is calculated based only on information provided by the request If any new city is added to the database then regardless of this fact the same etag is generated and a 304 not modified will be returned
That's the case when you need to add some custom key bit like this
The ETAG here should have been generated here for example by generating a hash over the city names returned by the query
What about the ListSqlQueryKeyBit combined with the bit described above?
For example generate an ETAG based on all id's and updated_at fields in the database
I don't like the idea with perfoming the request by the all entities in the database.
I would be happy to do an implementation in a PR of the idea proposed in this ticket
Let's do this. I will be glad to review it.
Hey man, sorry for getting back to you only now. Our project has moved away from using drf-extensions in favor of using an adapted version of Django's own caching/etag machinery. So I won't be having time to really dive into this. Closing the issue therefore and wishing you all the best!
I agree with the author, the current ETAG implementation is totally broken. The only way for it to not be broken is to override the default key constructor and to construct a key that will change whenever the content changes, which is not mentioned in the documentation.
There is really no reason to use this implementation alongside view caching, despite the documentation alluding to this approach, since as @shanx mentions any reasonable ETAG generation will have to do some kind of processing to determine if the contents have changed.
IMO the best implementation would be for the default to check the cache for the generated key and use that to determine if it's valid or not. The only benefit here is that you then don't need to transfer unchanged data back to the client again.