Use canonical HTTP requests instead of URLs as db indexes
This is a followup of #1, which covers two different subjects. This particular issue is about replacing the URL as a handle for cached content with a richer object including information from request headers, i.e. a simplified or canonical version of the HTTP request.
As indicated in #1:
Actually, since different request headers may cause different responses and documents, we may not use the URL as an index, but rather the hash of the request itself after putting it into some "canonical" form. […] Injector injects [hash of canonical request]. When requesting a URL, the client constructs the canonical request again, hashes it, and looks up [the document]. […] This storage format also avoids enumerating the URLs stored by ipfs-cache, unless the client or injector also upload QmBLAHBLAH… to IPFS, of course.
From @inetic:
About [what to include in the key], it's probably a very good idea to support multiple languages, but I think the number of variables in the key should be limited as much as possible. It's because with each such variable the number of keys per URL grows exponentially. This would (a) make the database huge and (b) would (also exponentially) decrease the number of peers in a swarm corresponding to any particular key. […] Does it make sense to store that the requester asked for HTTP/1.1? Are there modern browsers that don't support compression? Do we care about the order of requester's language preference? Do we want two separate swarms for en-US and en with k and l peers respectively, or do we prefer one big swarm with k+l peers? Do we care about the 'q' parameters? Given that we know that example.com/foo.html has mime type text/html, do we need to store that the client would have accepted other types as well?
Lastly, I think the main reason to hash the keys would be to obfuscate the content. Thus it wouldn't be trivially possible to see what's stored in the database. On the other hand it would still be possible just by fetching the values from ipfs, or guessing. I'm not totally convinced we need that, but I'm not against either, perhaps we need to list more pros and cons and make a consensus in the team. Also, there is still the chance that we'll be able to persuade the guys from IPFS to add salt to their mutable DHT data as BitTorrent does. In such case we wouldn't even need the database.
In the mean time, we could encode the keys in a similar way you suggested by concatenating all the important variables in a string, separating them with a colon. E.g.: GET:http://example.com/foo.html?bar=baz:en
From @ivilata:
Regarding [what to include in the key", I acknowledge that the devil is in the details and we should go over HTTP request headers to choose which ones to include and how to preprocess their values to avoid an explosion of keys while not discriminating some users (e.g. language-wise). I just kept the 3 ones which I think may affect the actual content returned by the origin server, but careful review is needed. We cannot skip headers like [Accept] (or their values) since the client needs to know the canonical request before getting the answer from the server (e.g. to get content from the cache). […]
Regarding (hashing the keys), hashing is specially useful in this specific proposal since using the whole request as an index would make the db way bigger. Yes it practically obfuscates the index of the db but if the owner of an injector would like to know what it is storing, the injector could as well store the request itself (locally or in IPFS, which should map to the key which appears in the index — ideally).
[…] if Accept-Language includes (say) French and English, we really cannot know what the Language of the response will be until we have the actual response from the server. Thus, the only way to reduce Accept-Language in the canonical request to the actual value of Language from the response would be for the injector to compute it post facto.
Now imagine that the server returned a page in English. If the same or a different client wanted to retrieve the page (with the same FR-EN preference) and it wasn't able to reach the origin (nor the injector), when canonicalizing the request on its own, if the process just kept French (1st lang preference) in Accept-Language, it's pre facto version of the request wouldn't match the injector's post facto version and the client wouldn't be able to retrieve a page which was actually in the distributed cache.
One solution to this is to have a clear canonicalization process which happens pre facto at the client side, so that an injector just checks that its format is ok and forwards it to the origin.
[…] That's the point where we must strike a balance between diversity (pushing for more/richer headers, e.g. keeping multiple entries in Accept-Language, possibly with country hints) and swarmability/privacy (pushing for less/simpler headers, e.g. having a single, language-only Accept-Language or even none). Maybe there could be a configurable "privacy level" (or its inverse) where a user could progressively toggle content customization options (language, encoding, etc.) to get different levels of privacy, customization or swarmability. It would affect which headers would be included in the request and their richness, but in any case the rules used to canonicalize these headers should be clear.
From @inetic:
If we don't hash the canonized requests, then the client could apply its own logic for choosing a language.
E.g. say that the database contained entries:
GET:http://example.com/foo.html?bar=baz:en
GET:http://example.com/foo.html?bar=baz:fr
GET:http://example.com/foo.html?bar=baz:esand the user would send a request with Accept-Language first fr and then en. The client would in such case be able to sort these entries and return the fr version first. Granted that this could get more complicated if we start to require sorting by multiple parameters, though I'd say its still preferable to spend CPU cycles on users's device than reduce swarm sizes.
For the argument of hashing the canonized request to compress the keys, I think actually compressing the database before it's put into IPFS may be a better approach (or perhaps IPFS already does so?).