monkey2 icon indicating copy to clipboard operation
monkey2 copied to clipboard

Case conversion functions don't work with non-ASCII characters

Open sgg-labs opened this issue 7 years ago • 15 comments

Tested on Windows 10 desktop target:

Print "Привет".ToUpper()
Print "Привет".ToLower()
Print "привет".Capitalize()
Print "ÄÄ".ToLower()
Print "ää".ToUpper()
Print "ää".Capitalize()

output:

Привет
Привет
привет
ÄÄ
ää
ää

sgg-labs avatar Mar 15 '18 02:03 sgg-labs

On Mac OS this is solved by setting codepage to en_US.UTF-8 before calling std::toupper() or std::tolower()

#include <clocale>
...
const char *lcc = std::setlocale(LC_CTYPE, "en_US.UTF-8");
if (lcc) {
	bb_print(bbString(lcc));
} else {
	bb_print(bbString(L"NULL",4));
}

Windows is a different beast: setlocale returns NULL for this codepage value. Any other valid (for Windows) settings like "" or "English" lead to case conversion working only partially - Cyrillic characters are converted to something unrecognizable.

sgg-labs avatar Mar 20 '18 03:03 sgg-labs

This seems to work on windows (msvc x64 anyway...):

bbString bbString::toUpper()const{
std::setlocale( LC_ALL,"English" );   //this seems to be the correct
incantation....
Rep *rep=Rep::alloc( length() );
for( int i=0;i<length();++i ) rep->data[i]=std::towupper( data()[i] );
 //note: using std::towupper.
return rep;
}

You'll also need these at the top of the file:

#include <cwctype>
#include <clocale>

The cyrillic looks OK to me but I can't be sure - here's app output:

ПРИВЕТ
привет
Привет
ää
ÄÄ
Ää

It's OK to use bbString chars directly with towupper (or std::towupper) so you don't need to realloc the string - bbString chars are effectively wchar_t's.

On Tue, Mar 20, 2018 at 4:47 PM, secondgeargames [email protected] wrote:

On Mac OS this is solved by setting codepage to en_US.UTF-8 before calling std::toupper()

const char *lcc = std::setlocale(LC_CTYPE, "en_US.UTF-8"); if (lcc) { bb_print(bbString(lcc)); } else { bb_print(bbString(L"NULL",4)); }

Windows is a different beast: set locale returns NULL for this codepage value. Any other valid (for Windows) settings like "" or "English" lead to case conversion working only partially - Cyrillic characters are converted to something unrecognizable.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/blitz-research/monkey2/issues/355#issuecomment-374464046, or mute the thread https://github.com/notifications/unsubscribe-auth/ADU3QsWP5R_t0yQTJ98lSfNxfwPAUVGLks5tgHu2gaJpZM4SretL .

blitz-research avatar Mar 20 '18 04:03 blitz-research

::towupper works here too.

Sure it isn't your theme affecting cyrillic output? I'm using 'hollow'...

On Tue, Mar 20, 2018 at 5:09 PM, Mark Sibly [email protected] wrote:

This seems to work on windows (msvc x64 anyway...):

bbString bbString::toUpper()const{
std::setlocale( LC_ALL,"English" );   //this seems to be the correct
incantation....
Rep *rep=Rep::alloc( length() );
for( int i=0;i<length();++i ) rep->data[i]=std::towupper( data()[i] );
 //note: using std::towupper.
return rep;
}

You'll also need these at the top of the file:

#include <cwctype>
#include <clocale>

The cyrillic looks OK to me but I can't be sure - here's app output:

ПРИВЕТ
привет
Привет
ää
ÄÄ
Ää

It's OK to use bbString chars directly with towupper (or std::towupper) so you don't need to realloc the string - bbString chars are effectively wchar_t's.

On Tue, Mar 20, 2018 at 4:47 PM, secondgeargames <[email protected]

wrote:

On Mac OS this is solved by setting codepage to en_US.UTF-8 before calling std::toupper()

const char *lcc = std::setlocale(LC_CTYPE, "en_US.UTF-8"); if (lcc) { bb_print(bbString(lcc)); } else { bb_print(bbString(L"NULL",4)); }

Windows is a different beast: set locale returns NULL for this codepage value. Any other valid (for Windows) settings like "" or "English" lead to case conversion working only partially - Cyrillic characters are converted to something unrecognizable.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/blitz-research/monkey2/issues/355#issuecomment-374464046, or mute the thread https://github.com/notifications/unsubscribe-auth/ADU3QsWP5R_t0yQTJ98lSfNxfwPAUVGLks5tgHu2gaJpZM4SretL .

blitz-research avatar Mar 20 '18 04:03 blitz-research

Seems to work with mingw and msvc x86 too!

On Tue, Mar 20, 2018 at 5:11 PM, Mark Sibly [email protected] wrote:

::towupper works here too.

Sure it isn't your theme affecting cyrillic output? I'm using 'hollow'...

On Tue, Mar 20, 2018 at 5:09 PM, Mark Sibly [email protected] wrote:

This seems to work on windows (msvc x64 anyway...):

bbString bbString::toUpper()const{
std::setlocale( LC_ALL,"English" );   //this seems to be the correct
incantation....
Rep *rep=Rep::alloc( length() );
for( int i=0;i<length();++i ) rep->data[i]=std::towupper( data()[i] );
 //note: using std::towupper.
return rep;
}

You'll also need these at the top of the file:

#include <cwctype>
#include <clocale>

The cyrillic looks OK to me but I can't be sure - here's app output:

ПРИВЕТ
привет
Привет
ää
ÄÄ
Ää

It's OK to use bbString chars directly with towupper (or std::towupper) so you don't need to realloc the string - bbString chars are effectively wchar_t's.

On Tue, Mar 20, 2018 at 4:47 PM, secondgeargames < [email protected]> wrote:

On Mac OS this is solved by setting codepage to en_US.UTF-8 before calling std::toupper()

const char *lcc = std::setlocale(LC_CTYPE, "en_US.UTF-8");
if (lcc) {
	bb_print(bbString(lcc));
} else {
	bb_print(bbString(L"NULL",4));
}

Windows is a different beast: set locale returns NULL for this codepage value. Any other valid (for Windows) settings like "" or "English" lead to case conversion working only partially - Cyrillic characters are converted to something unrecognizable.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/blitz-research/monkey2/issues/355#issuecomment-374464046, or mute the thread https://github.com/notifications/unsubscribe-auth/ADU3QsWP5R_t0yQTJ98lSfNxfwPAUVGLks5tgHu2gaJpZM4SretL .

blitz-research avatar Mar 20 '18 04:03 blitz-research

Just checked on Windows: std::towupper with "English" worked (msvc x86)! Thank you Mark! Has nothing to do with theme - I'm using Prime-Blue.

sgg-labs avatar Mar 20 '18 04:03 sgg-labs

2 targets down, feel free to add more!

On Tue, Mar 20, 2018 at 5:21 PM, secondgeargames [email protected] wrote:

Just checked on Windows: std::towupper with "English" worked (msvc x86)! Thank you Mark!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/blitz-research/monkey2/issues/355#issuecomment-374469792, or mute the thread https://github.com/notifications/unsubscribe-auth/ADU3QimL-y2jDj0MVXwV-mmbPNeV0c3Dks5tgIPjgaJpZM4SretL .

blitz-research avatar Mar 20 '18 04:03 blitz-research

On Android (emulator) this doesn't work:

#include <cwctype>
#include <clocale>
...
std::setlocale( LC_CTYPE,"en_US.UTF-8" );
Rep *rep=Rep::alloc( length() );
for( int i=0;i<length();++i ) rep->data[i]=std::towupper( data()[i] );
return rep;
...

although setlocale returns "C.UTF-8". Changing LC_CTYPE to LC_ALL has no effect either.

sgg-labs avatar Mar 20 '18 05:03 sgg-labs

The above code works on both macOS and iOS (simulator). It didn't work on Android (emulator and device) and I don't have a Linux machine set up to test the Linux target.

I was curious how it is solved in Godot. It seems they just built a lookup table of upper and lower case characters (some 650 or so of them) and use it for case conversion. Guaranteed to be cross-platform :)

Are there any functions that would work with your utf8 encoded CStrings?

sgg-labs avatar Mar 20 '18 15:03 sgg-labs

According to this, we'll need to use Java code for ToUpper/ToLower/Capitalize - yuck:

https://issuetracker.google.com/issues/36974962

On Wed, Mar 21, 2018 at 4:39 AM, secondgeargames [email protected] wrote:

The above code works on both macOS and iOS (simulator). It didn't work on Android (emulator on Windows) and I don't have a Linux machine set up tp test Linux target.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/blitz-research/monkey2/issues/355#issuecomment-374644858, or mute the thread https://github.com/notifications/unsubscribe-auth/ADU3QkY_OZNQg7fVVy8896--4prfRLYSks5tgSLBgaJpZM4SretL .

blitz-research avatar Mar 20 '18 21:03 blitz-research

Ok, attempt one at android version complete - please give it a whirl and also check other versions, I'll do linux now.

On Wed, Mar 21, 2018 at 10:05 AM, Mark Sibly [email protected] wrote:

According to this, we'll need to use Java code for ToUpper/ToLower/Capitalize - yuck:

https://issuetracker.google.com/issues/36974962

On Wed, Mar 21, 2018 at 4:39 AM, secondgeargames <[email protected]

wrote:

The above code works on both macOS and iOS (simulator). It didn't work on Android (emulator on Windows) and I don't have a Linux machine set up tp test Linux target.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/blitz-research/monkey2/issues/355#issuecomment-374644858, or mute the thread https://github.com/notifications/unsubscribe-auth/ADU3QkY_OZNQg7fVVy8896--4prfRLYSks5tgSLBgaJpZM4SretL .

blitz-research avatar Mar 20 '18 22:03 blitz-research

It appears to work fine in:

  • Android (compiled on Windows, tested on both emulator and device)
  • Windows 10 (msvc and gcc x86)
  • MacOS and iOS (simulator)

I agree with "yuck". It would be much easier if c++ had a consistent set of cross-platform standard libraries for such essential things. Sorry it turned out more trouble than I expected.

sgg-labs avatar Mar 21 '18 02:03 sgg-labs

Sorry it turned out more trouble than I expected.

No problemo, had to be done and we've got a Monkey2Lang.java file in there now which'll make future problems like this easier to deal with.

I can kind of see the problem - they've really got to provide 2 implementations of this stuff, it's probably not worth it. What I'm mainly surprised by is that ToLower and ToUpper can produce different results in different locale? I would have guessed that, working with unicode at least, they would just do the same thing everywhere...

On Wed, Mar 21, 2018 at 3:28 PM, secondgeargames [email protected] wrote:

It appears to work fine in:

  • Android (compiled on Windows, tested on both emulator and device)
  • Windows 10 (msvc and gcc x86)
  • MacOS

I agree with "yuck". It would be a better world if c++ had a consistent set of cross-platform standard libraries for such essential things. Sorry it turned out more trouble than I expected.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/blitz-research/monkey2/issues/355#issuecomment-374818092, or mute the thread https://github.com/notifications/unsubscribe-auth/ADU3QtVnFiiZuB-gslcckOrm1koAlz2Cks5tgbrMgaJpZM4SretL .

blitz-research avatar Mar 21 '18 21:03 blitz-research

Found one edge case. I don't ask you to fix it, I'm just writing about it here for future reference. German eszett is capitalized differently on different platforms: Windows: Straße becomes STRAßE Android: Straße becomes STRASSE I found the same behavior in Monkey-X, and I just worked around it.

sgg-labs avatar Mar 22 '18 13:03 sgg-labs

Just out of interest, which one of those is correct? Based on my very limited knowledge of german, they are both acceptable, aren't they?

arpie42 avatar Mar 22 '18 17:03 arpie42

AFAIK they are moving towards capital ß instead of SS, but both are still accepted. I personally prefer ß: the length of the word is preserved, and there is no problems with things like "Straße".ToUpper().ToLower() = "strasse".

sgg-labs avatar Mar 22 '18 22:03 sgg-labs