Double decoding path parameters
Hi, I have a question about decoding path params. I have a URL that contains encoded symbols in its static and parametric parts. To match the route, I decode the URL using the decodeURI function, but in the parameter, I have one of these symbols (# $ & + , / : ; = ? @) that doesn't decode by decodeURI. It decodes with the decodeURIComponent function. If someone encodes one of these symbols twice, it will be decoded also twice. I'm wondering if there would be a problem of double decoding in this case? https://owasp.org/www-community/Double_Encoding
'use strict'
const { equal } = require('assert')
const { match } = require('path-to-regexp')
function doubleEncode (str) {
return encodeURIComponent(encodeURIComponent(str))
}
function singleEncode (str) {
return encodeURIComponent(str)
}
const handler = match('/š/:id', { decode: decodeURIComponent });
equal(handler(decodeURI(`/%F0%9F%8D%8C/${singleEncode('#')}`)).params.id, '#'); // ok
equal(handler(decodeURI(`/%F0%9F%8D%8C/${doubleEncode('#')}`)).params.id, '#'); // ok
equal(handler(decodeURI(`/%F0%9F%8D%8C/${doubleEncode('#')}`)).params.id, singleEncode('#')); // fails
There definitely would be if you're using both decode and decoding it before handling. In that case you're probably better off encoding the input string to match using encodeURI and keeping the decode logic the same. Haven't thought it through 100% sure but it seems logical enough.
Have you considered using { encode: encodeURI, decode: decodeURIComponent }?
I guess that can help. Thanks.
Hi, I found a tricky case. Just want to ask what is the correct way to deal with it. How should I set this route?
Pattern route: /~:param Input route: /%7E%2523
Expected param: %23
This is definitely a very tricky problem. Every character can technically have 2 (or more due to different ways to encode). I can think of a way to work with this. Two initial thoughts:
- Configuration option that encodes
(char: string) => string[]so every character can be represented in the regex - Guidance/utility that normalizes before passing it into path to regexp
Option 1 is simplest, but I'm leaning toward option 2 for performance reasons. I'll mull this over this week, but feel free to add your own thoughts on the topic š Option 2 also helps with other routing libraries, since normalization is a common problem between them all. I also think that, conveniently, the same library could be used to normalize the input to path-to-regexp as the URL itself.
About 1 option. Each symbol in the path can be represented in the ASCII encoding. If you add this option, that would mean if I want to be sure that my path match in 100% of cases, I will need to add these function (char: string) => string[] for each symbol for each route. And you're right, it would be a significant performance drop.
2 option. Can you give me an example of this "normalization"? Or an example of another routing lib, that can it?
nodejs:
new URL('/%7E', 'http://test.c').pathname // /%7E
chrome:
new URL('/%7E', 'http://test.c').pathname // /~
Oh wow, I didn't expect those to act differently, I would have expected both to return ~. I think https://web.dev/urlpattern/ will likely attempt to resolve these differences, but I'm not familiar with any other node.js path matching or router libraries working on this problem.
For option 2, it'd likely be written by hand. Something that has consistent rules for everything, and does the kind of normalization URL is meant to do plus some extras like changing repeated //// to single slash.
About URLPattern. You can join to the conversation https://github.com/kenchris/urlpattern-polyfill/issues/93
About writing own normalization lib. It should normalize but not decode the URL. Iām not sure that it is a good idea.
Yep, exactly. To avoid these issues we'd need to focus on normalizing the format to something that won't be confused. E.g. always go to the encoded format.
To clarify your example, the issue is mostly just the ~ not encoding by default. It'd work as expected using /%7E:param. So what we'd want to build is just a kind of superset of encodeURI. That way when you use it for path-to-regexp and for URL inputs both ~ and %7E end up as %7E and match.
I think that ā~ā should be normalized path.
Encoding all not encoded chars for each request would be expensive.
If you run both inputs thought the same "encoder/normalization" it technically shouldn't matter. Trivial sample code would be x.replace(/[^%a-z0-9]/g, x => '%' + x.charCodeAt(0).toString(16)) (pretty sure this'll break but the example stands).
Encoding all not encoded chars for each request would be expensive.
Not really, but there's also no other solution to what you're asking for.
Another trivial example is handling of (space) - some servers expect + to work, others %20, this is the kind of thing that could be normalized by a library. Alternatively, you can decide it's not worth handling as-is and just require the client to be using ~ which is all the browsers encode it as.
For example, it's not as if GitHub supports URL encoding all the random characters in this URL and have it still work. There's only certain normalizations they apply before trying to route. If we're worried about performance it's better to leave that decision up to people's frameworks or applications.