node-pty icon indicating copy to clipboard operation
node-pty copied to clipboard

Several strings in win\pty.cc are converting wide strings to narrow strings

Open Tyriar opened this issue 8 years ago • 3 comments
trafficstars

From @rprichard in https://github.com/Tyriar/node-pty/pull/103#discussion_r126276801

I'm not 100% sure, but I think the initializer of msg_ converts the wide string to a narrow string by truncating each wchar_t value to a char. It will only be correct for ASCII characters. I'm guessing node.js and/or v8 want UTF-8 strings? Standard C++ has a way to do that conversion (as of C++11, I think -- I'm thinking of std::codecvt), and maybe node.js / v8 also does? Windows has C-style APIs to do the conversion.

I think the pattern already existed in the code, so it's probably not necessary to fix it in this PR? e.g. see shellpath_'s initializer.

Tyriar avatar Jul 09 '17 19:07 Tyriar

@Tyriar Imho the conversion is not needed at all since Windows uses internally a type of UTF-16 for wchar_t and so does Javascript (not sure though if they agree on all parts under all circumstances as there are several flavors of UTF-16).

Untested:

wchar_t *some_wstring;
...
Nan::MaybeLocal<v8::String> error = Nan::New<v8::String>(
    (uint16_t *) some_wstring,
    wcslen(some_wstring));
Nan::ThrowError(error.ToLocalChecked());

jerch avatar Jul 11 '17 18:07 jerch

@jerch I did something like that but I wasn't sure what the best way to test it was.

Tyriar avatar Jul 11 '17 21:07 Tyriar

@Tyriar The official docs state the following:

  • Windows: UTF-16 (see https://msdn.microsoft.com/en-us/library/dd374081.aspx), this source does not say anything about endianess, on a typical x86 Windows uses LE, probably they just use the arch endianess
  • Javascript: UTF-16 or UCS-2 (see http://es5.github.io/x2.html and https://mathiasbynens.be/notes/javascript-encoding), again no endianess is predefined, I guess they simply use the arch endianess again

V8 developer doc mentions UTF-16 for method v8::String::NewFromTwoByte (http://bespin.cz/~ondras/html/classv8_1_1String.html#a876615eb027092a6a71a4e7d69b82d00) - seems to be safe to just cast the wchar_t* to uint16_t*.

And for testing - hmm, to see if all wchar_ts get correctly mapped and understood in JS maybe you can use the codePointAt() method in JS and compare the values with the codepoint number from C++.

Btw the new for (xy in s) and the Array.from(s) in JS act on codepoints and not on charcodes anymore, which is helpful to get an iteration on those running. String.length still maps to the charcode length though.

jerch avatar Jul 12 '17 09:07 jerch