esapi-java-legacy PercentCodec Doesn't Handle UTF-8 percent encoding

//Place in EncoderTest
    public void testESAPIPercentEncoding() {
        String input = "%E2%84%A2";
        String expected = "™";
        Encoder e = ESAPI.encoder();
        assertEquals(expected, e.canonicalize(input));
    }

It seems that the uri encoder is defaulting to a non-UTF-8 decoding, and I don't see anyway to control this behavior when searching ESAPI.properties

(This migrated from an issue I accidently opened on the newer 3.0 project.)

May 20 '16 16:05 xeno6696

@kwwall -- I think to solve this problem, the only solution would be to create a fork of the PercentCodec that is specialized to UTF-8, and then have users specify the UTF-8 codec on demand.

The only problem is that UTF-8 would probably be the better one to use by default since it's a web standard.

Thoughts?

Jul 17 '17 16:07 xeno6696

Why not make a new class, say DefaultUTF8Encoder. If people want to use that one, then can just specify something like:

ESAPI.Encoder=org.owasp.esapi.reference.DefaultUTF8Encoder

in their ESAPI.properties file. That's more or less how it's designed to work. Also, if you change the CTOR(s) in DefaultEncoder from private to protected, you probably can subclass DefaultEncoder and inherit a lot of its desired behavior.

Make sense? If not, it's probably because I'm misunderstanding something.

Note that also if you think it makes more sense to make the UTF8 encoder the default one, we can change the ESAPI.properties files to reflect that as well.

-kevin

On Mon, Jul 17, 2017 at 1:04 PM, Matt Seil [email protected] wrote:

@kwwall https://github.com/kwwall -- I think to solve this problem, the only solution would be to create a fork of the PercentCodec that is specialized to UTF-8, and then have users specify the UTF-8 codec on demand.

The only problem is that UTF-8 would probably be the better one to use by default since it's a web standard.

Thoughts?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ESAPI/esapi-java-legacy/issues/377#issuecomment-315814214, or mute the thread https://github.com/notifications/unsubscribe-auth/AB3nmwlx3UxMGNNDWmdgU3V2RYdYAl3Zks5sO5JzgaJpZM4IjW-r .

-- Blog: http://off-the-wall-security.blogspot.com/ | Twitter: @KevinWWall NSA: All your crypto bit are belong to us.

Jul 18 '17 05:07 kwwall

My concern about UTF-8 would be how many people would complain on the upgrade.

That said, it would help spread the word that folks ought to be "all utf-8" at least as far as web app encodings.

On Mon, Jul 17, 2017, 22:13 Kevin W. Wall [email protected] wrote:

Why not make a new class, say DefaultUTF8Encoder. If people want to use that one, then can just specify something like:

ESAPI.Encoder=org.owasp.esapi.reference.DefaultUTF8Encoder

in their ESAPI.properties file. That's more or less how it's designed to work. Also, if you change the CTOR(s) in DefaultEncoder from private to protected, you probably can subclass DefaultEncoder and inherit a lot of its desired behavior.

Make sense? If not, it's probably because I'm misunderstanding something.

Note that also if you think it makes more sense to make the UTF8 encoder the default one, we can change the ESAPI.properties files to reflect that as well.

-kevin

On Mon, Jul 17, 2017 at 1:04 PM, Matt Seil [email protected] wrote:

@kwwall https://github.com/kwwall -- I think to solve this problem, the only solution would be to create a fork of the PercentCodec that is specialized to UTF-8, and then have users specify the UTF-8 codec on demand.

The only problem is that UTF-8 would probably be the better one to use by default since it's a web standard.

Thoughts?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub < https://github.com/ESAPI/esapi-java-legacy/issues/377#issuecomment-315814214 , or mute the thread < https://github.com/notifications/unsubscribe-auth/AB3nmwlx3UxMGNNDWmdgU3V2RYdYAl3Zks5sO5JzgaJpZM4IjW-r

.

-- Blog: http://off-the-wall-security.blogspot.com/ | Twitter: @KevinWWall NSA: All your crypto bit are belong to us.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/ESAPI/esapi-java-legacy/issues/377#issuecomment-315960047, or mute the thread https://github.com/notifications/unsubscribe-auth/AJEAQSAnvXkKdCaEUrCaTSl1JSID6PGgks5sPD8UgaJpZM4IjW-r .

Jul 18 '17 05:07 xeno6696

Oh yeah, and a new encoder is exactly what I had in mind, with that kind of config.

Jul 19 '17 16:07 xeno6696

If concerned about people complaining about the upgrade to the UTF-8 encoder, we could leave the default as is (DefaultEncoder) and just mention the alternative (DefaultUTF8Encoder, or whatever you want to name it)in a comment in ESAPI.properties just before the ESAPI.Encoder property.

Jul 20 '17 01:07 kwwall

@kwwall turns out this is non-trivial as well. @jeremiahjstacey I'm having a mental block right now. I'm not sure how to properly differentiate UTF-8 encoded strings... like %E2 is legal in the ASCII range, but how to tell the difference between %E2 and %E2%84%A2?

Sep 01 '19 16:09 xeno6696

Isn't the usual approach to go with the longest match?

-kevin

Blog: http://off-the-wall-security.blogspot.com/ | Twitter: @KevinWWall NSA: All your crypto bit are belong to us.

On Sun, Sep 1, 2019, 12:49 Matt Seil [email protected] wrote:

@kwwall https://github.com/kwwall turns out this is non-trivial as well. @jeremiahjstacey https://github.com/jeremiahjstacey I'm having a mental block right now. I'm not sure how to properly differentiate UTF-8 encoded strings... like %E2 is legal in the ASCII range, but how to tell the difference between %E2 and %E2%84%A2?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ESAPI/esapi-java-legacy/issues/377?email_source=notifications&email_token=AAO6PG3XYFGREY4UVJF4D33QHPXAHA5CNFSM4CENN6V2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5UGMCY#issuecomment-526935563, or mute the thread https://github.com/notifications/unsubscribe-auth/AAO6PGZPJAPT2G74GRS3AGTQHPXAHANCNFSM4CENN6VQ .

Sep 01 '19 16:09 kwwall

@xeno6696 without digging into it too much I'd suggest checking what the PushbackString would give you in a test case. I have it in my head that it defaults to the shortest group, but it's been a while since I've looked at it.

Sep 01 '19 17:09 jeremiahjstacey

@jeremiahjstacey is correct, the PBS defaults to the shortest match.

Sep 01 '19 18:09 xeno6696

Works for me. I think I was thinking of regex matching.

-kevin

Blog: http://off-the-wall-security.blogspot.com/ | Twitter: @KevinWWall NSA: All your crypto bit are belong to us.

On Sun, Sep 1, 2019, 14:30 Matt Seil [email protected] wrote:

@jeremiahjstacey https://github.com/jeremiahjstacey is correct, the PBS defaults to the shortest match.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ESAPI/esapi-java-legacy/issues/377?email_source=notifications&email_token=AAO6PGYJG3NG6XJ55CWNGY3QHQC3JA5CNFSM4CENN6V2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5UICIQ#issuecomment-526942498, or mute the thread https://github.com/notifications/unsubscribe-auth/AAO6PG443ADEUOMSEIQCMGDQHQC3JANCNFSM4CENN6VQ .

Sep 01 '19 18:09 kwwall

I could have sworn I saw a post in here asking about the percent codec...

Oct 15 '20 05:10 xeno6696

@xeno6696 - You did. I still have it in my inbox. It was from Alessandro Giannone and said "Any chance the percent encoding is working with UTF-8?"

Perhaps he deleted it when he added comment to (closed) issue #303?? See https://github.com/ESAPI/esapi-java-legacy/issues/303#issuecomment-708700394 for details. Or perhaps he wasn't using a patched version of ESAPI to test it???

@agiannone -- Is there a reason you deleted your comment from this issue (Issue #377)?

Oct 15 '20 05:10 kwwall

@xeno6696 @kwwall - Thanks for the super prompt response!

After reading through the issues list and doing a bit of studying on the subject, I actually ended up moving my comment to #303 referencing #300 both of which seemed more applicable.

Effectively non-BMP characters use 3 pairs of hex digits to store the character. The PercentCodec evaluates them 1 pair at a time.

Oct 15 '20 12:10 agiannone

Since this is still open, I wanted to toss out what seems like a useful resource, at least to me:

UTR#36: Unicode Security Considerations

Sep 02 '22 21:09 kwwall

esapi-java-legacy esapi-java-legacy copied to clipboard

PercentCodec Doesn't Handle UTF-8 percent encoding

-kevin

-kevin

esapi-java-legacy
esapi-java-legacy copied to clipboard