path-to-regexp Double decoding path parameters

Hi, I have a question about decoding path params. I have a URL that contains encoded symbols in its static and parametric parts. To match the route, I decode the URL using the decodeURI function, but in the parameter, I have one of these symbols (# $ & + , / : ; = ? @) that doesn't decode by decodeURI. It decodes with the decodeURIComponent function. If someone encodes one of these symbols twice, it will be decoded also twice. I'm wondering if there would be a problem of double decoding in this case? https://owasp.org/www-community/Double_Encoding

'use strict'

const { equal } = require('assert')
const { match } = require('path-to-regexp')

function doubleEncode (str) {
  return encodeURIComponent(encodeURIComponent(str))
}

function singleEncode (str) {
  return encodeURIComponent(str)
}

const handler = match('/🍌/:id', { decode: decodeURIComponent });

equal(handler(decodeURI(`/%F0%9F%8D%8C/${singleEncode('#')}`)).params.id, '#'); // ok
equal(handler(decodeURI(`/%F0%9F%8D%8C/${doubleEncode('#')}`)).params.id, '#'); // ok
equal(handler(decodeURI(`/%F0%9F%8D%8C/${doubleEncode('#')}`)).params.id, singleEncode('#')); // fails

Apr 21 '22 16:04 ivan-tymoshenko

There definitely would be if you're using both decode and decoding it before handling. In that case you're probably better off encoding the input string to match using encodeURI and keeping the decode logic the same. Haven't thought it through 100% sure but it seems logical enough.

Apr 22 '22 04:04 blakeembrey

Have you considered using { encode: encodeURI, decode: decodeURIComponent }?

Apr 22 '22 04:04 blakeembrey

I guess that can help. Thanks.

Apr 25 '22 10:04 ivan-tymoshenko

Hi, I found a tricky case. Just want to ask what is the correct way to deal with it. How should I set this route?

Pattern route: /~:param Input route: /%7E%2523

Expected param: %23

May 11 '22 13:05 ivan-tymoshenko

This is definitely a very tricky problem. Every character can technically have 2 (or more due to different ways to encode). I can think of a way to work with this. Two initial thoughts:

Configuration option that encodes (char: string) => string[] so every character can be represented in the regex
Guidance/utility that normalizes before passing it into path to regexp

Option 1 is simplest, but I'm leaning toward option 2 for performance reasons. I'll mull this over this week, but feel free to add your own thoughts on the topic 😄 Option 2 also helps with other routing libraries, since normalization is a common problem between them all. I also think that, conveniently, the same library could be used to normalize the input to path-to-regexp as the URL itself.

May 11 '22 22:05 blakeembrey

About 1 option. Each symbol in the path can be represented in the ASCII encoding. If you add this option, that would mean if I want to be sure that my path match in 100% of cases, I will need to add these function (char: string) => string[] for each symbol for each route. And you're right, it would be a significant performance drop.

2 option. Can you give me an example of this "normalization"? Or an example of another routing lib, that can it?

May 11 '22 22:05 ivan-tymoshenko

nodejs:

new URL('/%7E', 'http://test.c').pathname // /%7E

chrome:

new URL('/%7E', 'http://test.c').pathname // /~

May 11 '22 22:05 ivan-tymoshenko

Oh wow, I didn't expect those to act differently, I would have expected both to return ~. I think https://web.dev/urlpattern/ will likely attempt to resolve these differences, but I'm not familiar with any other node.js path matching or router libraries working on this problem.

For option 2, it'd likely be written by hand. Something that has consistent rules for everything, and does the kind of normalization URL is meant to do plus some extras like changing repeated //// to single slash.

May 11 '22 23:05 blakeembrey

About URLPattern. You can join to the conversation https://github.com/kenchris/urlpattern-polyfill/issues/93

May 11 '22 23:05 ivan-tymoshenko

About writing own normalization lib. It should normalize but not decode the URL. I’m not sure that it is a good idea.

May 11 '22 23:05 ivan-tymoshenko

Yep, exactly. To avoid these issues we'd need to focus on normalizing the format to something that won't be confused. E.g. always go to the encoded format.

May 11 '22 23:05 blakeembrey

To clarify your example, the issue is mostly just the ~ not encoding by default. It'd work as expected using /%7E:param. So what we'd want to build is just a kind of superset of encodeURI. That way when you use it for path-to-regexp and for URL inputs both ~ and %7E end up as %7E and match.

May 11 '22 23:05 blakeembrey

I think that “~” should be normalized path.

May 11 '22 23:05 ivan-tymoshenko

Encoding all not encoded chars for each request would be expensive.

May 11 '22 23:05 ivan-tymoshenko

If you run both inputs thought the same "encoder/normalization" it technically shouldn't matter. Trivial sample code would be x.replace(/[^%a-z0-9]/g, x => '%' + x.charCodeAt(0).toString(16)) (pretty sure this'll break but the example stands).

Encoding all not encoded chars for each request would be expensive.

Not really, but there's also no other solution to what you're asking for.

May 11 '22 23:05 blakeembrey

Another trivial example is handling of (space) - some servers expect + to work, others %20, this is the kind of thing that could be normalized by a library. Alternatively, you can decide it's not worth handling as-is and just require the client to be using ~ which is all the browsers encode it as.

May 11 '22 23:05 blakeembrey

For example, it's not as if GitHub supports URL encoding all the random characters in this URL and have it still work. There's only certain normalizations they apply before trying to route. If we're worried about performance it's better to leave that decision up to people's frameworks or applications.

May 11 '22 23:05 blakeembrey