amppackager
amppackager copied to clipboard
Figure out proper URL-escaping once and for all
Neither url.Parse() nor url.String() sanitizes the query component. See e.g. https://play.golang.org/p/8hsD2WeMVYD
#190 and #192 fix some cases resulting from this, but there may be other edge cases that fall out from this quick fix. Investigate https://golang.org/src/net/url/url.go?s=7976:8008#L96 and see how it compares to what we should be doing per https://tools.ietf.org/html/rfc3986.
https://tools.ietf.org/html/rfc3986#section-3.4 says query = *( pchar / "/" / "?" ). Encoding a query string as a path component would extraneously escape / and ?. This seems fine.
In addition, per https://tools.ietf.org/html/rfc3986#section-3.3, pchar may or may not contain :, depending on context. Golang acts conservatively and always escapes it. This seems fine, too.
AFAICT, nothing remains unescaped that should be escaped.
https://play.golang.org/p/aVV6I6q-mPV demonstrates that > in the host goes unescaped. This is definitely in violation of the ABNF for host at https://tools.ietf.org/html/rfc3986#section-3.2.2, and will likely break Link header parsers.
Go parser explicitly allows "<", and ">".
https://golang.org/src/net/url/url.go?s=2590:2650#L100
I wonder if a simple post processor that globally replaces '<', '>' on the result from url.String() will suffice? That would escape it throughout (host, path, query, fragment).
We can't escape the host value directly on the Url struct after Parse and before String because the pct-encoding itself gets encoded yet again. e.g. https://play.golang.org/p/hCNEKynIvXl
Wow, there's no way to represent a pct-encoded host in Go's url package? That's frustrating. Yeah, I guess a post-processor would work. Are there any other magic characters we need to worry about?
URLs get sanitized in a few different places: https://github.com/ampproject/amppackager/blob/73dec5d4648395b1ce47deb9d0065c2c719c8986/packager/signer/validation.go#L16 https://github.com/ampproject/amppackager/blob/73dec5d4648395b1ce47deb9d0065c2c719c8986/packager/signer/signer.go#L384 https://github.com/ampproject/amppackager/blob/73dec5d4648395b1ce47deb9d0065c2c719c8986/transformer/internal/amphtml/urls.go#L54
I need to do an audit of all of these, and hopefully unify some of the code where requirements are aligned.
Unfortunately, I'm really far away from being able to prioritize this. My WIP is at https://github.com/twifkak/amppackager/tree/url_escaping. I think the way forward is to implement our own URL parser (for the subset of the spec that we care about, e.g. http/https).
Some notes from my attempt:
- Using the uriparser library for fuzz testing worked really well. Anecdotally, the lib seems accurate. Any mismatches I found so far were Go's fault.
- The random sample generator could use improvement to better exercise all the code paths in the URL parser (like afl does). That said, the current approach of adding hacks to minimize the test space when we ran out of bugs at the higher level seems to work okay.
- We could use uriparser in prod, but AFAIK cgo minimizes portability so I lean against this. That said, maybe it's sufficient if it works in alpine?
- My attempts to hack around the go lib with input/output mods were getting more and more difficult. One unsolved problem I recall is that the Go lib doesn't store whether the URL ends in a
?, so we would need to keep some side state.
Just a random drive-by comment without understanding the full scope. It might be possible to port https://github.com/ampproject/amphtml/blob/master/validator/js/engine/parse-url.js fairly easily (it's fairly minimal and doesn't have any javascript-specific bits or dependencies).
Thanks for the tip. I'm not familiar with parse-url.js but its test set looks pretty light compared to the branching factor of RFC3986, so I fear it may have some edge case bugs, too. I'd prefer to start from RFC3986 (or a library that hews closely to it) as a base. Note that the SXG inner = outer requirement drives, I think, stricter requirements than is needed by the AMP JS validator. (Edit: also the content sniffing concern.)
I think I would avoid the whatwg URL spec because that tries to map to what browsers do (including all sorts of semantic-breaking transformations that better meet expectations of users typing non-compliant URLs).
My recollection is that libcurl is somewhere in-between those two specs.
Fair enough, drive-by comment retracted. :)