antisamy fix(#456): fix default charset problem

the fix of this issue

#456

May 31 '24 07:05 GodMeowIceSun

@spassarop - Have you looked at this issue/proposed fix yet? If there is a real problem here it would be nice to fix this and include it in the release we are close to getting done related to the neko-html fix.

Jun 03 '24 15:06 davewichers

ISO-8859-1 is the default encoding for Windows, but just about every other modern OS uses UTF-8. It can especially bite you in the butt when using String.getBytes(). I recommend changing it to use UTF-8.

See https://stackoverflow.com/questions/7048745/what-is-the-difference-between-utf-8-and-iso-8859-1#7048774 for details of how they are different.

-kevin

On Wed, Jun 5, 2024, 11:52 PM GodMeowIceSun @.***> wrote:

@GodMeowIceSun https://github.com/GodMeowIceSun requested your review on: #457 https://github.com/nahsra/antisamy/pull/457 fix(#456 https://github.com/nahsra/antisamy/issues/456): fix default charset problem.

— Reply to this email directly, view it on GitHub https://github.com/nahsra/antisamy/pull/457#event-13060505901, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAO6PG4LLTCSVCS2OOT3FNTZF7MG3AVCNFSM6AAAAABISIQENCVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJTGA3DANJQGU4TAMI . You are receiving this because your review was requested.Message ID: @.***>

Jun 06 '24 05:06 kwwall

I agree with Kevin. Probably the default Windows encoding was used on policy XML because the content of that file was unlikely to have characters beyond that encoding.

I am not sure why there were error messages in Chinese but were not tested too see if the text would show correctly. I mean, if the display was tested it would look obviously broken.

Probably if an environment does not support UTF-8, AntiSamy usage should not be the main issue.

Then, I share Kevin suggestion to update the encoding to UTF-8 on this code change and test it.

On Thu, 6 Jun 2024 at 02:23 Kevin W. Wall @.***> wrote:

ISO-8859-1 is the default encoding for Windows, but just about every other modern OS uses UTF-8. It can especially bite you in the butt when using String.getBytes(). I recommend changing it to use UTF-8.

See

https://stackoverflow.com/questions/7048745/what-is-the-difference-between-utf-8-and-iso-8859-1#7048774 for details of how they are different.

-kevin

On Wed, Jun 5, 2024, 11:52 PM GodMeowIceSun @.***> wrote:

@GodMeowIceSun https://github.com/GodMeowIceSun requested your review on: #457 https://github.com/nahsra/antisamy/pull/457 fix(#456 https://github.com/nahsra/antisamy/issues/456): fix default charset problem.

— Reply to this email directly, view it on GitHub https://github.com/nahsra/antisamy/pull/457#event-13060505901, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAO6PG4LLTCSVCS2OOT3FNTZF7MG3AVCNFSM6AAAAABISIQENCVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJTGA3DANJQGU4TAMI>

. You are receiving this because your review was requested.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/nahsra/antisamy/pull/457#issuecomment-2151435228, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHL3BMPHIOCU4ZAWHBAEX73ZF7W35AVCNFSM6AAAAABISIQENCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJRGQZTKMRSHA . You are receiving this because you were mentioned.Message ID: @.***>

Jun 06 '24 11:06 spassarop

I agree with Kevin. Probably the default Windows encoding was used on policy XML because the content of that file was unlikely to have characters beyond that encoding.

I am not sure why there were error messages in Chinese but were not tested too see if the text would show correctly. I mean, if the display was tested it would look obviously broken.

Probably if an environment does not support UTF-8, AntiSamy usage should not be the main issue.

Then, I share Kevin suggestion to update the encoding to UTF-8 on this code change and test it.

On Thu, 6 Jun 2024 at 02:23 Kevin W. Wall @.***> wrote:

ISO-8859-1 is the default encoding for Windows, but just about every other modern OS uses UTF-8. It can especially bite you in the butt when using String.getBytes(). I recommend changing it to use UTF-8.

See

https://stackoverflow.com/questions/7048745/what-is-the-difference-between-utf-8-and-iso-8859-1#7048774 for details of how they are different.

-kevin

On Wed, Jun 5, 2024, 11:52 PM GodMeowIceSun @.***> wrote:

@GodMeowIceSun https://github.com/GodMeowIceSun requested your review on: #457 https://github.com/nahsra/antisamy/pull/457 fix(#456 https://github.com/nahsra/antisamy/issues/456): fix default charset problem.

— Reply to this email directly, view it on GitHub https://github.com/nahsra/antisamy/pull/457#event-13060505901, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAO6PG4LLTCSVCS2OOT3FNTZF7MG3AVCNFSM6AAAAABISIQENCVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJTGA3DANJQGU4TAMI>

. You are receiving this because your review was requested.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/nahsra/antisamy/pull/457#issuecomment-2151435228, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHL3BMPHIOCU4ZAWHBAEX73ZF7W35AVCNFSM6AAAAABISIQENCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJRGQZTKMRSHA . You are receiving this because you were mentioned.Message ID: @.***>

Perhaps you can take a look at the document I referenced, which proves in reverse that the default encoding of ResourceBundle in the original implementation of Jdk8 was ISO-8859-1, while our project is compatible with Java 8+, and the internationalization method used is ResourceBundle. Perhaps this is the core reason for the problem. I don't think directly modifying it to UTF-8 can solve the problem. Perhaps different versions of JDK need to be treated differently or the Java version should use ISO-8859-1 as the default encoding, which is also a temporary solution

oracle jdk9 doc

Jun 06 '24 11:06 GodMeowIceSun

@GodMeowIceSun wrote:

Perhaps you can take a look at the document I referenced, which proves in reverse that the default encoding of ResourceBundle in the original implementation of Jdk8 was ISO-8859-1, while our project is compatible with Java 8+, and the internationalization method used is ResourceBundle. Perhaps this is the core reason for the problem. I don't think directly modifying it to UTF-8 can solve the problem. Perhaps different versions of JDK need to be treated differently or the Java version should use ISO-8859-1 as the default encoding, which is also a temporary solution

oracle jdk9 doc

I did read through this. I think the worst case scenario here is that someone may have to convert the ResourceBundle from ISO-8859-1 to UTF-8. But every Java instance that I've ever worked with, dating back to JDK 1.1, has supported UTF-8, so I don't think this will be a problem, and it's only a 1 line change to check it. If it's a problem, we can convert the encoding for the ResourceBundle. But if it's all in ASCII, we shouldn't even need to do that. I certainly don't think a different minimal JDK version needs to be updated to handle this though.

Jun 06 '24 16:06 kwwall

I'm not recommending it, but I suppose if you really wanted to, you could tweak the code to use a character set based on a system property and just have it default to UTF-8 if not specified. The one thing I do know if UTF-8 will always be available, regardless of whatever language preferences. I suspect that ISO-8859-1 usually would be, but I'm not sure that is 100% the case if you were using Chinese or Kanji, etc. But UTF-8 is used for other internal encodings. E.g., it's often used with cryptography related code, otherwise going between Windows and *nix often causes problems. For example if shows up in these classes:

$ cd $JAVA_HOME
$ grep -r '"UTF-8"' src | grep /crypto/
com/sun/crypto/provider/PBKDF2KeyImpl.java: *    to bytes using UTF-8 character encoding.
com/sun/crypto/provider/PBKDF2KeyImpl.java:        Charset utf8 = Charset.forName("UTF-8");
sun/security/krb5/internal/crypto/dk/AesDkCrypto.java:            saltUtf8 = salt.getBytes("UTF-8");
sun/security/krb5/internal/crypto/dk/DkCrypto.java:// String.getBytes("UTF-8");
sun/security/krb5/internal/crypto/dk/DkCrypto.java:        Charset utf8 = Charset.forName("UTF-8");
javax/crypto/CryptoPermissions.java:        parser.read(new BufferedReader(new InputStreamReader(in, "UTF-8")));

I think it is used when stuff needs to be serialized / deserialized so it will work across operating systems, but I can't swear to that as it's been 15-20 years since I looked at that code.

That said, ISO-8859-1 would almost certainly work, but it's just the natural choice that UTF-8 is. Just my $.02 though. In ESAPI, we use UTF-8 for everything, but this is your choice. I just think UTF-8 is more neutral whereas ISO-8859-1 has a Western European slant.

Jun 16 '24 21:06 kwwall

@kwwall - Do you suggest I close this as won't fix?

Nov 18 '24 18:11 davewichers

@kwwall - Do you suggest I close this as won't fix?

I suppose the correct thing to do is for someone to correct all the AntiSamy properties files and XML files as per the Oracle JDK 9 document that @GodMeowIceSun referenced, and use UTF-8 instead of ISO-8859-1. (And if that doesn't work, use a system property like I suggested earlier.)

Ideally, that someone who would change this would be the one who has submitted the issue, but that doesn't seem to be happening here. However, it seems likely that others are impacted. If I were you, I'd give the PR submitter another chance to make these changes. Set a deadline and if it's not met close this PR. But I don't think I would close issue #456 as the issue seems legitimate. If you end up closing this PR because some date is past, then I probably would slap a 'Help Wanted' label and/or 'Good First Issue' label on the issue and set the priority to Low and just leave it open.

Nov 19 '24 04:11 kwwall

@GodMeowIceSun - If you can implement this PR per @kwwall description, we can potentially merge those changes in. But as is, we cannot. Do you think you can make the changes as he describes?

Nov 19 '24 16:11 davewichers