rage-edit Optimize stdout&stderr decoding for Windows ANSI Code Page

https://github.com/MikeKovarik/rage-edit/compare/master...asinbow:patch-1

This is actually is NOT a PR, but a suggestion. I have seen whole content of issue #9, and changes in https://github.com/MikeKovarik/rage-edit/blob/6e18ee85437f2b97d493b93de43ed1e9e1e2c117/src/Registry.mjs#L44 . cmd.exe /c chcp has side effect, and I think it should be the only choice. In our own application we have implemented a function ChangeCodePage by using node-ffi:

const ChangeCodePage = (buffer, options = {}) => {
  const {
    fromCodePage = 'acp',
    toCodePage = 'utf8',
    toString = true,
    nullTerminated = false,
  } = options;
  // using Windows API WideCharToMultiByte and MultiByteToWideChar
  return WideCharToMultiByte(MultiByteToWideChar(buffer, fromCodePage), {
    encoding: toCodePage,
    toString,
    nullTerminated,
  });
};

So, if there is a Registry.decodeStdout option, life would be much easier! 💯

Dec 24 '18 03:12 asinbow

At the moment Registry.enableUnicode is undocumented, but it and its side effect are already described in the readme of the next version: https://github.com/MikeKovarik/rage-edit/tree/2.0.0-beta#options.unicode In short, the current implementation reads current code page, changes cp to 65001, reads registry data, and instantly changes cp back.

It would make sense to include ffi for perfectly safe unicode operations, but:

rage-edit has no dependencies and is only ≈74 KB, while ffi is ≈23 MB and requires node-gyp related stuff. It would be silly to include so huge library just to correct unicode output (which is not even needed, mostly).
The package depends on built-in reg.exe cli app, and broken unicode is not reg.exe's fault but cmd.exe's output. If ffi was a part of rage-edit, it would be more prudent to use Registry API instead of just fixing cmd.exe related problems.

The only idea that I can think of is internally check the ffi availability (without including it to the dependencies list) and use your ChangeCodePage instead of chcp if ffi is found, but does that worth it? I mean, both chcp and ChangeCodePage look like hacks. Wouldn't other packages affect this package's output? Or should ChangeCodePage be called before each reg.exe call? The latter sounds like just a faster version of current implementation.

Honestly, I don't know how ffi works so I may be wrong.

Dec 24 '18 09:12 OddMorning

Thanks for you reply! It's not a good idea to make rage-edit depend on ffi, yes, I agee. And I only integrate ffi in my own application. Directly converting stdout buffer to string makes no chance to fix text encoding manually.

proc.stdout.on('data', data => stdout += data.toString())

In my own opinion, it would be much better and friendlier if there is an option Registry.decodeStdout(function). How do you think about this?

Dec 24 '18 10:12 asinbow

Oh, that's what you mean. But data still depends on default code page, at least in my case.

On my current PC the default code page is 866 (Russian) and I can read cyrillic letters well using your method and a bit of iconv-lite's magic:

proc.stdout.on('data', data => stdout += iconv.decode(data, 'cp866'))

But when I try to read something else (for example, 谷歌翻譯幫助我), it turns into a bunch of nonconvertible data:

await Registry.get('HKLM\\SOFTWARE\\aaaa', 'Unicode')
    ==>
// No encodings, just turning Buffer into a hex string
proc.stdout.on('data', data => stdout += data.toString('hex'))
    ==>
0d0a484b45595f4c4f43414c5f4d414348494e455c534f4654574152455c616161610d0a20202020556e69636f6465202020205245475f535a202020203f3f3f3f3f3f3f0d0a0d0a
    ==>
HKEY_LOCAL_MACHINE\SOFTWARE\aaaa
    Unicode    REG_SZ    ???????

In my case 谷歌翻譯幫助我 turned into ??????? (3f3f3f3f3f3f3f in hex) because cp866 doesn't support those characters.

So, the only thing you can do to raw data is convert it from your system code page to utf8 (which most probably could be automated). And again, it's just my own research so I may be wrong.

Dec 24 '18 13:12 OddMorning

Вот, automating this process is a really great idea! Get Windows console codepage reveals how to use Windows API GetConsoleOutputCP and SetConsoleOutputCP. And in document of SetConsoleOutputCP, Microsoft tells us the current code page is stored in HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage\ACP: We can query it out and then apply iconv-lite decoding.

Dec 25 '18 01:12 asinbow

In my case (and in your too it seems) it's OEMCP, not ACP: 25 12 2018 11-35-38 vivaldi

Well, we will see what can be done. All of that looks like a good "lite version" of unicode option.

Dec 25 '18 09:12 OddMorning

I just did some tests and:

Raw reg.exe calls take ≈15ms;
chcp 65001 + reg.exe + chcp (old value) takes ≈50ms;
iconv.decode(data, 'cp866') takes ≈45ms.

So both chcp and iconv are almost equally slow compared to "raw calls", while chcp gives more abilities (full confidence that data is not corrupted). I doubt that an app would be interrupted after the first chcp call and before the second one. Of course there's a tiny chance of that, but even the get-port package has a similar "failure" chance, and that's not the package's fail.

I want to say, either Registry.decodeStdout or iconv can be added, but again, would that worth it? Both of them depend on system. If an app counts on 936 code page, but system locale is English (437), you won't be able to just decode input from cp936 (just like chcp 936 will always say "Invalid code page"), and decoding from 437 is pointless. There are too many cases where those options can behave different in different situations. In my opinion, for public usage iconv/decodeStdout is not safe, and for personal usage chcp is already not that bad.

In short, the idea looks good for particular cases, but it would make code bloated without any great benefit globally. I wonder what @MikeKovarik thinks about all of that.

Dec 25 '18 12:12 OddMorning

rage-edit rage-edit copied to clipboard

Optimize stdout&stderr decoding for Windows ANSI Code Page

rage-edit
rage-edit copied to clipboard