simplecpp wide char

I think simplecpp should be able to use wide characters

Jul 21 '16 14:07 danmar

static unsigned char readChar(std::istream &istr, unsigned int bom)
{
    unsigned char ch = (unsigned char)istr.get();

    // For UTF-16 encoded files the BOM is 0xfeff/0xfffe. If the
    // character is non-ASCII character then replace it with 0xff
    if (bom == 0xfeff || bom == 0xfffe) {
        const unsigned char ch2 = (unsigned char)istr.get();
        const int ch16 = (bom == 0xfeff) ? (ch<<8 | ch2) : (ch2<<8 | ch);
        ch = (unsigned char)((ch16 >= 0x80) ? 0xff : ch16);
    }

    // Handling of newlines..
    if (ch == '\r') {
        ch = '\n';
        if (bom == 0 && (char)istr.peek() == '\n')
            (void)istr.get();
        else if (bom == 0xfeff || bom == 0xfffe) {
            int c1 = istr.get();
            int c2 = istr.get();
            int ch16 = (bom == 0xfeff) ? (c1<<8 | c2) : (c2<<8 | c1);
            if (ch16 != '\n') {
                istr.unget();
                istr.unget();
            }
        }
    }

    return ch;
}

The current master head code already have such function, which I think can handle wide characters. (UTF16).

I think another way to handle this kind of issue is that you can convert the file content to UTF8 encoding if you detect the file is in UTF16. Otherwise, the return value from this function can't be put in a single byte (unsigned char).

If I remember correctly, GCC and Clang both internally use UTF8, so the currently way should be OK. A unicode character should only exists in comments.

Oct 01 '17 14:10 asmwarrior

the Token::str is a normal std::string and does not allow wide char data.

Try to preprocess such code:

int åäö = 123;

Oct 03 '17 19:10 danmar

OK, I see. See some reference: c++ - 😃 (and other unicode characters) in identifiers not allowed by g++ - Stack Overflow and GCC Bug 67224, sounds like only GCC does not support this feature.

Oct 04 '17 02:10 asmwarrior