styx Urls with %% characters cause Styx to respond with an HTTP 400

The problem

When Styx encounters requests with urls that contain % characters in them it responds with an HTTP 400. Addings % to unwise chars configuration makes it so that the encoder ends up double encoding already encoded chars.

I have created a commit to add details and context around the issue we are facing, but its unclear which direction should be taken to resolve this issue.

Detailed description

Test cases and comments adding context can be found here: https://github.com/yashvyas/styx/commit/97089f6bf40be1dea231b001aa1b6950db24f274

Styx is able to handle unencoded urls that have % in them appropriately, as shown in the test cases.

In our use case, it is acceptable to not encode the %% encountered in the url (and seems like other Url libraries like the one discussed below do the same thing). However, this is a problem for the Url class in Styx since it uses java.net.uri. The URI syntax specification states the following with regards to % data characters:

Because the percent ("%") character serves as the indicator for percent-encoded octets, it must be percent-encoded as "%25" for that octet to be used as data within a URI. Implementations must not percent-encode or decode the same string more than once, as decoding an already decoded string might lead to misinterpreting a percent data octet as the beginning of a percent-encoding, or vice versa in the case of percent-encoding an already percent-encoded string.

reference: https://tools.ietf.org/html/rfc3986#section-2.4

Styx seems to depend on the configuration and encode any URL it encounters based on the configuration even it was partially encoded.

One potential fix could be to utilize a URL parsing library that can handle modern URL data characters by itself rather than needing the configuration, like https://square.github.io/okhttp/3.x/okhttp/okhttp3/HttpUrl.html which seems to not encode the %% but parse the URL appropriately as seen below:

        String encoded = "https://g.com/abc?%%=XYZ(%27blah%27)=%%";
        String unencoded = "https://g.com/abc?%%=XYZ('blah')=%%";
        HttpUrl encodedUrl = HttpUrl.parse(encoded);
        HttpUrl unencodedUrl = HttpUrl.parse(unencoded);

        System.out.println("\n\nfor encoded url: " + encoded);
        System.out.println("encoded queryStr is:" + encodedUrl.encodedQuery());
        System.out.println("queryStr is:" + encodedUrl.query());
        System.out.println("url str is: "+ encodedUrl.toString());

        System.out.println("\n\nfor unencoded url: " + unencoded);
        System.out.println("encoded queryStr is:" + unencodedUrl.encodedQuery());
        System.out.println("queryStr is:" + unencodedUrl.query());
        System.out.println("url str is: "+ unencodedUrl.toString());

Output:

for encoded url: https://g.com/abc?%%=XYZ(%27blah%27)=%%
encoded queryStr is:%%=XYZ(%27blah%27)=%%
queryStr is:%%=XYZ('blah')=%%
url str is: https://g.com/abc?%%=XYZ(%27blah%27)=%%


for unencoded url: https://g.com/abc?%%=XYZ('blah')=%%
encoded queryStr is:%%=XYZ(%27blah%27)=%%
queryStr is:%%=XYZ('blah')=%%
url str is: https://g.com/abc?%%=XYZ(%27blah%27)=%%

Not sure if the dependency on this library would be desirable. Additionally styx.api.Url seems to be somewhat analogous to this class. However, the okhttp implementation does not handle the case where a url has no authority which the Styx implementation does.

Another way would be to let users supply the encoding implementation, or Styx use an implementation that can handle these cases. Finally, its not clear from the RFC or the HttpUrl implementation above whether the %% in the query should actually be encoded. What we know is that the url with query of type "%%=XYZ(%27blah%27)=%%" does not cause issues with upstream consumers. Our current fix involes a combination of java URI and HttpUrl. To handle the cases where no authority is present and URI does not handle the parsing we add a localhost authority and use HttpUrl to parse the string url.

Acceptance criteria

Styx is able to accept urls with %% in them

Mar 23 '19 00:03 yashvyas

As a short summary, Styx Url.Builder uses java.net.URI to parse URLs to their subcomponents (scheme, path, query, etc). It is the URI.create() method that rejects URLs with double percentages (%%).

A request is rejected in NettyToStyxRequestDecoder when it attempts to convert a Netty HTTP object to a Styx representation in makeAStyxRequestFrom method:

    Url url = url(unwiseCharEncoder.encode(request.uri()))
            .build();

Which in turn calls:

    public static Builder url(String value) {
        URI uri = URI.create(value);     <---- The "%%"s are rejected here.
        return new Builder()
                .scheme(uri.getScheme())
                .authority(uri.getAuthority())
                .path(uri.getRawPath())
                .rawQuery(uri.getRawQuery())
                .fragment(uri.getFragment());
    }

Mar 25 '19 10:03 mikkokar

My impression is that the spec (chapter 2.4) prohibits double percentages (%%) unless encoded to %25%25. In this regard Styx behaviour is correct, although I understand your requirement for allowing them.

Additionally, Styx must assume an URLs are already percent-encoded. Otherwise it would be impossible to parse them in their constituent components. For this reason Styx couldn't just encode URLs again: encoding would be applied twice.

Maybe we could introduce an "unwise character sequences" configuration where the encoding is applied for the full character sequence instead of individual characters? Or even better, we could introduce a regexp-based URL replacement facility that is applied before Styx HttpRequest object is built. This would allow you to replace %% to %25%25s before URI.create is called. Would this work for you @yashvyas ?

Mar 26 '19 16:03 mikkokar

In this particular case it seems like the url comes in partially encoded, so "'" is already encoded and not an unwise char that we configure but %% is not. Styx doesn't check whether URL is encoded or not before attempting the unwise char encoding right? And the char based encoding results in the double encoding if urls are partially encoded, but works fine if urls are not encoded it seems like. For instance:

ConfigurableUnwiseCharsEncoder encoder = new ConfigurableUnwiseCharsEncoder(newStyxConfig("%,|"), logger);
        String urlWithUnwiseChars = "/user/abc?%%=XYZ(|blah|)=%%";
        assertThat(encoder.encode(urlWithUnwiseChars), is("/user/abc?%25%25=XYZ(%7Cblah%7C)=%25%25"));

works fine with the config, but the below failes because the %27 becomes %25%27.

      ConfigurableUnwiseCharsEncoder encoder = new ConfigurableUnwiseCharsEncoder(newStyxConfig("%"), logger);
        String urlWithUnwiseChars = "/user/abc?%%=XYZ(%27blah%27)=%%";
        assertThat(encoder.encode(urlWithUnwiseChars), is("/user/abc?%%=XYZ(%27blah%27)=%%"));
        verify(logger).warn("Value contains unwise chars. you should fix this. raw={}, escaped={}: ",
                urlWithUnwiseChars,
                "/user/abc?%%=XYZ(%27blah%27)=%%");

Seems like we are also dealing with partially encoded Urls though so second happens to be the case we encounter. Regex based seems like it would work well for the specific case of %%, but if partially encoded urls come in with a % then would that be handled as well? Because encoded chars would match and i'm not sure if there is a valid regex to match only the hex encoded chars, for example, what happens when there is a url with a %, %2Abacus (%2A is the encoding for *), and %27?

if we could make the encoder pluggable, something that is provided but can also be supplied then maybe we could use the regex based since our case is only for %%, but in general not sure if that would work for everyone. Also, then we could just implement ours and optionally submit to OSS Styx as an option if it seems like a useful alternative in general.

Mar 27 '19 14:03 yashvyas

Hi @yashvyas

Styx doesn't check whether URL is encoded or not before attempting the unwise char encoding right?

Browsers and other clients must send URLs in percent-encoded form. Otherwise the receiver (styx) wouldn't know how to parse ambiguous URLs into their constituent components. The same URL, if not percent encoded, could have multiple different ways of parsing it. For this reason Styx can only assume URLs are already encoded.

The specific problem in this case is %% character sequence. The spec doesn't seem to allow that. Understandably you need to pass these URLs through. Perhaps the applications are using %% to escape the percent character itself?

Regex based seems like it would work well for the specific case of %%, but if partially encoded urls come in with a % then would that be handled as well?

Clients/senders should encode URLs to the extent that Styx (and URL.create) can parse them properly. As long as we deal with %% then perhaps URL.create will work just fine? Do you have specific examples where this would fall apart?

if we could make the encoder pluggable, something that is provided but can also be supplied then maybe we could use the regex based since our case is only for %%, but in general not sure if that would work for everyone. Also, then we could just implement ours and optionally submit to OSS Styx as an option if it seems like a useful alternative in general.

Support and bug fixing are my particular concerns about pluggable model. Investigating and interoperability issues can get tricky when there is 3rd party code involved.

I would start with simpler and less intrusive model like regex based, and introduce additional interfaces as needed basis. Like possibly opening a new builder interface in NettyToStyxRequestDecoder for injecting an URL encoder class, etc. But ideally avoiding a new SPI interface if possible.

Mar 29 '19 09:03 mikkokar

Hi @mikkokar this got deprioritized since we tabled the migration for a bit. As we come close to using Styx 1.0 we will take a look at this. T Thank you for your inputs, I agree that a simpler/less intrusive approach like the one you suggested might be best to resolve this particular issue.

With the pluggable model, I meant that Styx could pick up the UnwiseCharEncoder from an impl. provided by apps using Styx (if needed) and the impl would be defined and maintained in the app code base rather than Styx.

Regarding URL.create, the issue seems to be that the library is too outdated and doesn't support newer valid URL char sequences. Hence Styx provides the UnwiseCharEncoder. While %% is indeed an invalid sequence it should be encoded as %25%25 as per the spec, (referencing item 2.4):

Because the percent ("%") character serves as the indicator for percent-encoded octets, it must be percent-encoded as "%25" for that octet to be used as data within a URI.

Regarding examples, I can't come up with any that will fall apart given a regex based approach. In fact after revisiting it seems like an impl like RegexBasedUnwiseCharacterEncoder (impl. of ConfigurableUnwiseCharEncoder) might be the best way without needed to depend on OkHttp and other third party libs.

Thank you! We will engage you guys further as we get close to executing the upgrade.

Jul 10 '19 20:07 exp-yashvyas

This was deprioritized as part of the original migration, we are still tracking it though and will tackle it when we are ready to move to styx1.0 without using a fork.

Feb 18 '20 16:02 exp-yashvyas

Closing all issues over 3 years old. New issues can be created if problems are still occurring.

Jan 11 '24 11:01 kvosper