http-extensions icon indicating copy to clipboard operation
http-extensions copied to clipboard

QUERY: Query body normalization is a bad idea

Open nicowilliams opened this issue 1 year ago • 13 comments

See discussion in HN thread.

nicowilliams avatar Sep 16 '24 21:09 nicowilliams

It might be a good idea to have a NORMALIZE method for requesting that the server normalize a QUERY body.

Basically, client-side normalization is going to be heavily dependent on the query MIME type. Do MIME registrations even provide normalization functions? Well, the registrations are fairly free-form, so that question is probably not relevant.

When JSON normalization has been discussed before it's proven to be an extremely tricky topic.

Server-side normalization, however, is much more likely to be possible, especially if the server parses a query into an AST, as then a canonical representation of the AST should be possible without having to specify any details in any RFCs.

nicowilliams avatar Sep 16 '24 21:09 nicowilliams

Are you suggesting to add text about normalization? Precisely where?

reschke avatar Sep 17 '24 05:09 reschke

OLD:

The cache key for a query (see Section 2 of [HTTP-CACHING]) MUST incorporate the request content. When doing so, caches SHOULD first normalize request content to remove semantically insignificant differences, thereby improving cache efficiency, by:

  • Removing content encoding(s)
  • Normalizing based upon knowledge of format conventions, as indicated by the any media type suffix in the request's Content-Type field (e.g., "+json")
  • Normalizing based upon knowledge of the semantics of the content itself, as indicated by the request's Content-Type field.

Note that any such normalization is performed solely for the purpose of generating a cache key; it does not change the request itself.

NEW:

The cache key for a query (see Section 2 of [HTTP-CACHING]) MUST incorporate the request body.

Caches SHOULD perform query normalization when possible when constructing cache keys. The extent to which query normalization can be performed, and how, is either a local implementation detail or a detail of the request body's MIME type, but note that at the time of this writing MIME type specifications and registrations do not typically include any normalization functions.

nicowilliams avatar Sep 17 '24 11:09 nicowilliams

I disagree.

This is an implementation detail. A server can normalize based on how it uses the media type. No public spec is required here.

reschke avatar Sep 18 '24 07:09 reschke

Yes, the server can normalize however it wants and it's a local detail for it. But caches need not be located in the server, and caches may not be able to normalize the same way as the servers.

nicowilliams avatar Sep 18 '24 10:09 nicowilliams

I’d like to expand on the semantic nature of QUERY. Given the name, it implies querying an endpoint, but how do we semantically define what constitutes a "query"? Without clear guidelines, there’s potential for abuse, as QUERY could be used for anything.

Unlike GET, POST, PUT, or PATCH, which are generic, QUERY sounds specific, but the ambiguity opens the door to misuse. Additionally, with many possible query languages (e.g., SQL, GraphQL), if the protocol doesn’t define this, as it does with URIs (which have a well-defined query syntax), maybe QUERY isn’t the best name. Maybe: FETCH, RETRIEVE, REQUEST

And although I don't like the idea of ​​doing a GET with body, the body could be very strict like the query params of a URL.

rafageist avatar Sep 30 '24 00:09 rafageist

It is generic. It's safe method that can, contrary to GET, use a body.

What kind of "abuse" do you fear? If it's about not using URIs for stuff that could have a URI, that would indeed be sup-optimal, and we can add a reminder that this is not the point of QUERY. Ultimately, it's up to the people building stuff on top of HTTP.

Specific semantics will be defined based on media types of the request body.

reschke avatar Sep 30 '24 05:09 reschke

If the body content behaves similarly to a POST, a more neutral name might be more fitting, as it wouldn't limit its usage to strict queries. This would help avoid confusion about its actual purpose and align the functionality with the name. In this case, QUERY (formerly known as SEARCH) wouldn't be an enhanced GET, but rather an idempotent POST.

rafageist avatar Sep 30 '24 20:09 rafageist

@rafageist, if you disagree with the name, please open a separate issue. That's not the topic of this issue.

MikeBishop avatar Oct 01 '24 14:10 MikeBishop

@rafageist, if you disagree with the name, please open a separate issue. That's not the topic of this issue.

Ok, but the title say "Query body normalization is a bad idea", and maybe "Fetch body normalization is a good idea"

rafageist avatar Oct 01 '24 17:10 rafageist

The ticket is about what is currently in the spec, and that is QUERY. Can we please stop the nitpicking?

As Mike said, if you want to discuss the method name, open a separate ticket. Before you do so, please have a look at https://github.com/httpwg/http-extensions/issues/1614 before.

reschke avatar Oct 02 '24 04:10 reschke

@nicowilliams , I'd like to get this discussion resolved.

First of all, once a server responds with a Location field, a cache can simply cache this like GET to that URI, and the issue goes away. (right?)

I also don't think it would be good to discourage caches from normalizing. The three types of normalization seem to be pretty safe to me. If a server treats those type of normalisations differently, that would essentially be a very stupid thing for the server to do.

reschke avatar Oct 17 '24 15:10 reschke

Also, would it be correct to say that this is not about normalization in general, but specifically to normalisations outside the origin server?

reschke avatar Oct 17 '24 15:10 reschke

@nicowilliams it would be helpful to understand the motivation for the change. Pointing to a huge HN thread requires the reader to do an unreasonable amount of work - can you summarize?

mnot avatar Nov 07 '24 10:11 mnot

Discussed offline with @reschke. Absent more information, we think the best path forward is to add a security consideration along the lines of:

Caches that normalize QUERY content incorrectly or in ways that are significantly different than how the server processes the content can return the incorrect response if normalization results in a false positive.

mnot avatar Nov 17 '24 10:11 mnot

Should “server” be “resource”?

martinthomson avatar Nov 17 '24 20:11 martinthomson

Sorry for the super delayed answers!

Caches that normalize QUERY content incorrectly or in ways that are significantly different than how the server processes the content can return the incorrect response if normalization results in a false positive.

If the cache is shared this can cause cache poisoning.

Note that any such normalization is performed solely for the purpose of generating a cache key; it does not change the request itself.

If the cache normalizes incorrectly then this can yield aliasing issues.

What I don't understand is: how shall caches even normalize?

Normalizing based upon knowledge of the semantics of the content itself, as indicated by the request's Content-Type field.

Hmm, so do MIME type registrations indicate how to normalize documents? Must they now do so for content types that are suitable for use in QUERY request bodies?

nicowilliams avatar Mar 21 '25 22:03 nicowilliams

Caches that normalize QUERY content incorrectly or in ways that are significantly different than how the server processes the content can return the incorrect response if normalization results in a false positive.

If the cache is shared this can cause cache poisoning.

So clarify that this would be specifically bad for shared caches?

Note that any such normalization is performed solely for the purpose of generating a cache key; it does not change the request itself.

If the cache normalizes incorrectly then this can yield aliasing issues.

Please explain "aliasing" in this context.

Normalizing based upon knowledge of the semantics of the content itself, as indicated by the request's Content-Type field.

Hmm, so do MIME type registrations indicate how to normalize documents? Must they now do so for content types that are suitable for use in QUERY request bodies?

Some. No.

reschke avatar Mar 22 '25 06:03 reschke

Caches that normalize QUERY content incorrectly or in ways that are significantly different than how the server processes the content can return the incorrect response if normalization results in a false positive.

If the cache is shared this can cause cache poisoning.

So clarify that this would be specifically bad for shared caches?

If two different queries with different semantics get normalized to the same query then one user-agent can poison the cache for another. That's what I meant by aliasing. I suppose this seems unlikely.

Maybe I can rephrase my question entirely in terms of MIME types:

How shall a cache implementor normalize a QUERY request body? What references shall they look at? The MIME type registry's registration entries appear to be completely mum on this topic. JSON will commonly be used I imagine, but neither JSON nor I-JSON specify a canonical or deterministic encoding option. CBOR does specify a deterministic encoding option (which used to be referred to as canonical), so presumably MIME types that make use of CBOR can have QUERY bodies normalized by caches.

Perhaps JSON can be normalized to a degree: remove interstitial whitespace, canonicalize strings so that only characters that must be escaped are, and... do something about number representations. Normalization is a private details of the cache implementation I suppose. As long as the implementors do not incorrectly implement normalization I guess it's Ok. Fine.

nicowilliams avatar Mar 23 '25 00:03 nicowilliams

(Note: I didn't read the HN thread, as the link doesn't work for me. Perhaps I'm just repeating ideas that were already discussed there. Also, apologies for resurrecting a closed issue – I hope it's better than creating a new one for what's essentially the same problem.)

I think the spec should contain some advisory text for how to opt out of such normalization. Basically any service that parses user requests and returns detailed error information in the (fairly common) form of “line X, column Y: Incorrect ABC, expected DEF” would be negatively impacted by it, because any normalization is all but guaranteed to change these positions, even if it doesn't change the semantics of the processing.

This goes double for services which are specifically supposed to process potentially incorrect data, such as validators, or services which depend on the usually insignificant details, such as code/data formatters, formatting checkers or linters.

What is the best way of opting out? I can see three options:

  1. Use POST instead of QUERY, with all the downsides that QUERY is supposed to avoid.
  2. Specify a different content type, such as application/octet-stream and pass the actual type in a different manner, e.g. an X-Actual-Content-Type header (ugly!).
  3. Specify the Cache-Control: no-transform header in the request. I'm not sure whether this falls under the intended usage of the header, so if you believe it does, I believe the query-method spec should specifically point out that proxies MUST honor the header also for purposes of caching.

vidraj avatar Jun 05 '25 06:06 vidraj

I think the spec should contain some advisory text for how to opt out of such normalization. Basically any service that parses user requests and returns detailed error information in the (fairly common) form of “line X, column Y: Incorrect ABC, expected DEF” would be negatively impacted by it, because any normalization is all but guaranteed to change these positions, even if it doesn't change the semantics of the processing.

Hm, no. Normalization is about caches. I does not happen in the origin server (well, unless the server opts to do it).

For instance, a server that consumes XML (such as XSLT) could, upon parse errors, return whatever the XML parser makes available.

reschke avatar Jun 05 '25 16:06 reschke

I'll try to explain the issue with an example, hopefully that'll get my point across better than an abstract description. O:-)


Let's assume that two clients, A and B, are querying a server through a common cache. The server has an endpoint, /sum, which calculates a sum of a JSON array of numbers.

  1. Client A sends a query
    [ 1, 2, "3", 4 ]
    
    The cache passes it on as-is (no transformation takes place, therefore Cache-Control: no-transform is not needed).
  2. The server replies with a 400-class response saying Error: string found on line 1, number expected, with Cache-Control: max-age 3600. This is simplistic, but reasonsble error handling. The cache stores the response under a key containing (among other details) the transformed body [1,2,"3",4] with all whitespace removed.
  3. Client B sends a query
    [
        1,
        2,
        "3",
        4
    ]
    
    The cache looks it up internally under key [1,2,"3",4]. Since there is a fresh cached response, it doesn't pass the request further.
  4. The cache replies with Error: string found on line 1, number expected, a wrong message for this request – the actual error is on line 4.

vidraj avatar Jun 06 '25 04:06 vidraj

(This doesn't even require two clients: A single client who edits the first query, reformatting it on multiple lines to get better error information, would run into the same problem.)

vidraj avatar Jun 06 '25 04:06 vidraj

So you're saying the cache caches the 4xx?

reschke avatar Jun 06 '25 04:06 reschke

Yes, I'm assuming the presence of a working, HTTP-semantics-compliant shared cache.

vidraj avatar Jun 06 '25 06:06 vidraj

https://greenbytes.de/tech/webdav/rfc9110.html#overview.of.status.codes

So no, a 400 response by default is not cacheable.

reschke avatar Jun 06 '25 08:06 reschke

It's not heuristic caching here, the response has an explicit Cache-Control header.

From RFC 9111 (excerpt):

A cache MUST NOT store a response to a request unless:

  • the request method is understood by the cache;
  • the response status code is final […];
  • […] the cache understands the response status code;
  • the no-store cache directive is not present in the response […];
  • if the cache is shared: the private response directive is […] not present […];
  • if the cache is shared: the Authorization header field is not present […];
  • the response contains at least one of the following: […] a max-age response directive […]

All of these are fulfilled, so the cache may store the response. This hypothetical cache does.

vidraj avatar Jun 06 '25 08:06 vidraj

And this is not just a hypothetical – real-world implementations behave like this, at least according to the excellent https://cache-tests.fyi/ page (under “An optimal HTTP cache reuses a fresh 400 response with explict freshness”).

vidraj avatar Jun 06 '25 08:06 vidraj

OK.

So why would the origin server want to make the response cacheable?

Also, what's the probability that two clients send the same problematic request and actually have some way to process the response? IMHO that requires a human to understand and act on it.

This really seems like an edge case, which, FWIW, would not be specific to QUERY.

reschke avatar Jun 07 '25 04:06 reschke

So why would the origin server want to make the response cacheable?

I have three answers to that:

  1. Why wouldn't it? If you send the same request, you get the same response, so cache it. Allowing caching should be the default, rather than something exceptional.
  2. It allows the user-agent to cache the response. Caching in HTTP is uniform in the sense that you can't target specific caches (except for the private/public distinction, which is useless here), so if you want to alleviate the load on your servers by allowing browsers to just give the user back the previous response if they mash the submit button, or if they have the page open for months while closing and restarting the browser, you have to allow caching at all levels.
  3. I honestly thought the whole point of the QUERY proposal is to make more responses cacheable – it's a niche method, and the primary (niche) use is to allow clients to make cacheable structured, long queries. Unstructured and short queries fit into the URLs of GET requests, uncacheable queries can be made using POSTs. Structured short queries can be made using GETs, but it requires app-specific handling (e.g. /sum?q=%5B1%2C2%2C%223%22%2C4%5D&qtype=text%2Fjson).

Also, what's the probability that two clients send the same problematic request and actually have some way to process the response? IMHO that requires a human to understand and act on it.

Yes, a human makes the request and is then misled by the response. Is that surprising? Error messages are often meant to be shown to humans (perhaps through a syslog of some kind), even when the happy path (with successful response) leads to automatic processing.

As for the probability: I already gave you an example yesterday that doesn't depend on there being multiple clients at all. One client makes a request, realizes there's an error in it, submits a reformatted query to get a more informative message, gets the same message regardless.

If this client is a “hacker” (can open browser devtools, knows how to work with them), they might be able to force their user-agent to send the request with Cache-Control: no-cache to make the cache reload the response from the server. But you can't really bake this into the application outright, otherwise you lose all other desirable effects of caching, and you can't expect ordinary users to do this.

This really seems like an edge case, which, FWIW, would not be specific to QUERY.

Can you give me an example of another HTTP method that allows caches to serve mismatched responses? Also, yes, error handling is often considered only as an afterthought, if at all. I believe we should be trying to do better than that.

vidraj avatar Jun 07 '25 11:06 vidraj