wide char
I think simplecpp should be able to use wide characters
static unsigned char readChar(std::istream &istr, unsigned int bom)
{
unsigned char ch = (unsigned char)istr.get();
// For UTF-16 encoded files the BOM is 0xfeff/0xfffe. If the
// character is non-ASCII character then replace it with 0xff
if (bom == 0xfeff || bom == 0xfffe) {
const unsigned char ch2 = (unsigned char)istr.get();
const int ch16 = (bom == 0xfeff) ? (ch<<8 | ch2) : (ch2<<8 | ch);
ch = (unsigned char)((ch16 >= 0x80) ? 0xff : ch16);
}
// Handling of newlines..
if (ch == '\r') {
ch = '\n';
if (bom == 0 && (char)istr.peek() == '\n')
(void)istr.get();
else if (bom == 0xfeff || bom == 0xfffe) {
int c1 = istr.get();
int c2 = istr.get();
int ch16 = (bom == 0xfeff) ? (c1<<8 | c2) : (c2<<8 | c1);
if (ch16 != '\n') {
istr.unget();
istr.unget();
}
}
}
return ch;
}
The current master head code already have such function, which I think can handle wide characters. (UTF16).
I think another way to handle this kind of issue is that you can convert the file content to UTF8 encoding if you detect the file is in UTF16. Otherwise, the return value from this function can't be put in a single byte (unsigned char).
If I remember correctly, GCC and Clang both internally use UTF8, so the currently way should be OK. A unicode character should only exists in comments.
the Token::str is a normal std::string and does not allow wide char data.
Try to preprocess such code:
int åäö = 123;
OK, I see. See some reference: c++ - 😃 (and other unicode characters) in identifiers not allowed by g++ - Stack Overflow and GCC Bug 67224, sounds like only GCC does not support this feature.