lapis
lapis copied to clipboard
Lapis is using the wrong request uri from ngx
https://github.com/leafo/lapis/blob/d3a16d142f6127e8612178bf57209185b14e2d4e/lapis/nginx.moon#L77-L89
Lapis is checking ngx.var.request_uri for the request's uri, but this variable only points to the main request and uses a raw uri string which has some potential downsides (see below SO article). Instead, Lapis should use ngx.var.uri which is the current request uri, including any sub-requests. It is also normalized so any funky issues with the uri, including any encoding, is removed.
As of right now, using ngx.location.capture() to make sub-requests is wholly broken and unusable as it just loops the main request until the server throws a 500 error.
https://stackoverflow.com/questions/48708361/nginx-request-uri-vs-uri
I seem to remember request_uri being chosen intentionally here, I don't think I anticipated the issue with using capture on the same app though.
From what that SO article notes, it would seem that the two major differences between request_uri and uri are that request_uri will contain query strings whereas uri does not (the code following the variable assignment immediately cuts that off anyway) and that uri is normalized, which is probably more useful than the raw request string for almost every case.
Doing a bit more research, it appears that nginx as a whole prefers normalized URIs, citing security reasons among others. request_uri seems to be provided for cases such as redirecting rather than direct use, the opposite of Lapis' current behaviour.
While this would be a breaking change, I think switching to uri is the correct and expected behaviour here, regardless of the sub-request case. I think it is also fair to say that any form of user input, whether it be a web address or form data is subject to scrutiny and /asdf///////%20qwerty being rewritten as /asdf/ qwerty is the expected result for 99.99% of applications. For applications that want explicit/horrible routes, there are nginx configuration variables that allow you to disable the normaization of uri.
- https://nginx.org/en/docs/http/ngx_http_core_module.html#location
- https://nginx.org/en/docs/http/ngx_http_core_module.html#merge_slashes
Going beyond nginx and reading the specification for URIs (https://tools.ietf.org/html/rfc3986) in general, I think there has been some misunderstandings on what a URI actually is based on our conversations on Discord over the past few days.
What is a URI?
https://tools.ietf.org/html/rfc3986#section-1
A Uniform Resource Identifier (URI) provides a simple and extensible
means for identifying a resource. This specification of URI syntax
and semantics is derived from concepts introduced by the World Wide
Web global information initiative, whose use of these identifiers
dates from 1990 and is described in "Universal Resource Identifiers
in WWW" [RFC1630].
The spec defines a URI as a "simple and extensible means of identifying a resource", also noting that it has specific syntax and semantics that must be followed. Because it has syntax and semantics, a URI is not just a [unique] string such as the key in a key/value pair, but an actual path to a resource, relative or absolute.
https://tools.ietf.org/html/rfc3986#section-3
foo://example.com:8042/over/there?name=ferret#nose
\_/ \______________/\_________/ \_________/ \__/
| | | | |
scheme authority path query fragment
Most of what I'll be writing about and quoting will be about the path section of a fully qualified URI, as that seems to be the area of contention.
What is a URI's Path?
https://tools.ietf.org/html/rfc3986#section-3.3
The path component contains data, usually organized in hierarchical
form, that, along with data in the non-hierarchical query component
(Section 3.4), serves to identify a resource within the scope of the
URI's scheme and naming authority (if any).
This definition notes specifically that a URI's path is usually organized in a "hierarchical form". This is analoguous to (but not necessarily equivalent to) a structured filesystem. Each segment in a path is typically though of as a nested layer of directories where the left-most segment encapsulates all segments right of it. Similar to a filesystem, we'd expect something akin to:
example.com
/path
/to
/my
file.html
If we look at a typical blog-style path, we may see something like /articles/2020/10/06/this-is-my-article where each segment encapsulates all the segments to the right of itself. All articles are namespaced in the articles segment, and that can be broken down by years, and each of those years broken down by months, and each of those months by days, and each of those days will contain the articles.
The path segments "." and "..", also known as dot-segments, are
defined for relative reference within the path name hierarchy. They
are intended for use at the beginning of a relative-path reference
(Section 4.2) to indicate relative position within the hierarchical
tree of names. This is similar to their role within some operating
systems' file directory structures to indicate the current directory
and parent directory, respectively. However, unlike in a file
system, these dot-segments are only interpreted within the URI path
hierarchy and are removed as part of the resolution process (Section
5.2).
Relative path segments such as . and .. are required to be correctly resolved before being processed. Unlike a filesystem, these are meant to be resolved within the confines of the relative URI and cannot break out of the base URI.
URI Equivalency
A significant part of our discussion was based around the notion that one URI should point to one resource, and aliases should not exist at all. This isn't necessarily wrong, but I believe the misunderstanding here is focused on what constitutes an alias.
https://tools.ietf.org/html/rfc3986#section-6.1
In testing for equivalence, applications should not directly compare
relative references; the references should be converted to their
respective target URIs before comparison.
https://tools.ietf.org/html/rfc3986#section-6.2.1
False negatives are caused by the production and use of URI aliases.
Unnecessary aliases can be reduced, regardless of the comparison
method, by consistently providing URI references in an already-
normalized form (i.e., a form identical to what would be produced
after normalization is applied, as described below).
https://tools.ietf.org/html/rfc3986#section-6.2.2
Implementations may use logic based on the definitions provided by
this specification to reduce the probability of false negatives.
This processing is moderately higher in cost than character-for-
character string comparison. For example, an application using this
approach could reasonably consider the following two URIs equivalent:
example://a/b/c/%7Bfoo%7D
eXAMPLE://a/./b/../b/%63/%7bfoo%7d
Web user agents, such as browsers, typically apply this type of URI
normalization when determining whether a cached response is
available. Syntax-based normalization includes such techniques as
case normalization, percent-encoding normalization, and removal of
dot-segments.
https://tools.ietf.org/html/rfc3986#section-6.2.2.2
The percent-encoding mechanism (Section 2.1) is a frequent source of
variance among otherwise identical URIs. In addition to the case
normalization issue noted above, some URI producers percent-encode
octets that do not require percent-encoding, resulting in URIs that
are equivalent to their non-encoded counterparts. These URIs should
be normalized by decoding any percent-encoded octet that corresponds
to an unreserved character, as described in Section 2.3.
https://tools.ietf.org/html/rfc3986#section-6.2.2.3
The complete path segments "." and ".." are intended only for use
within relative references (Section 4.1) and are removed as part of
the reference resolution process (Section 5.2). However, some
deployed implementations incorrectly assume that reference resolution
is not necessary when the reference is already a URI and thus fail to
remove dot-segments when they occur in non-relative paths. URI
normalizers should remove dot-segments by applying the
remove_dot_segments algorithm to the path, as described in
Section 5.2.4.
https://en.wikipedia.org/wiki/Uniform_Resource_Identifier#Generic_syntax
A path component, consisting of a sequence of path segments separated by a slash (/). A path is always defined for a URI, though the defined path may be empty (zero length). A segment may also be empty, resulting in two consecutive slashes (//) in the path component.
All of the rules above talk about URI references and normalization. URI references, as far as I can determine, are the raw URIs that come into a system and must be normalized to produce the URI target. This normalization process should be applied to every URI reference at its destination. We can think of this similar to encoding/decoding JSON: the requesting client encodes JSON as the final step before sending the request, and the responding server decodes it as the first step after it is received. Because of this, reference URIs are not aliases, they are just raw data waiting to be normalized. It is only after normalization that we can determine if they are aliases.
Though I cannot find it specifically in the spec document, Wikipedia states that a path segment can be empty and thus many slashes can be adjacent to each other. Because of the semantics of path segments, these segments provide no information, hierarchical or otherwise, and can safely be compressed down to a single slash.
For URI normalization, the correct steps are as follows:
- Compress empty segments
- Resolve any dot-segments that may exist
- Decode percent-encoded data
Note that these are the rules that are applied to the ngx.var.uri variable.
ngx.var.uri normalization is stupid and wrong. It does not take in account the reserved characters.