rage-edit
rage-edit copied to clipboard
Optimize stdout&stderr decoding for Windows ANSI Code Page
https://github.com/MikeKovarik/rage-edit/compare/master...asinbow:patch-1
This is actually is NOT a PR, but a suggestion.
I have seen whole content of issue #9, and changes in https://github.com/MikeKovarik/rage-edit/blob/6e18ee85437f2b97d493b93de43ed1e9e1e2c117/src/Registry.mjs#L44 .
cmd.exe /c chcp
has side effect, and I think it should be the only choice.
In our own application we have implemented a function ChangeCodePage
by using node-ffi
:
const ChangeCodePage = (buffer, options = {}) => {
const {
fromCodePage = 'acp',
toCodePage = 'utf8',
toString = true,
nullTerminated = false,
} = options;
// using Windows API WideCharToMultiByte and MultiByteToWideChar
return WideCharToMultiByte(MultiByteToWideChar(buffer, fromCodePage), {
encoding: toCodePage,
toString,
nullTerminated,
});
};
So, if there is a Registry.decodeStdout
option, life would be much easier!
💯
At the moment Registry.enableUnicode
is undocumented, but it and its side effect are already described in the readme of the next version:
https://github.com/MikeKovarik/rage-edit/tree/2.0.0-beta#options.unicode
In short, the current implementation reads current code page, changes cp to 65001
, reads registry data, and instantly changes cp back.
It would make sense to include ffi
for perfectly safe unicode operations, but:
-
rage-edit
has no dependencies and is only ≈74 KB, whileffi
is ≈23 MB and requiresnode-gyp
related stuff. It would be silly to include so huge library just to correct unicode output (which is not even needed, mostly). - The package depends on built-in
reg.exe
cli app, and broken unicode is notreg.exe
's fault butcmd.exe
's output. Ifffi
was a part ofrage-edit
, it would be more prudent to use Registry API instead of just fixingcmd.exe
related problems.
The only idea that I can think of is internally check the ffi
availability (without including it to the dependencies list) and use your ChangeCodePage
instead of chcp
if ffi
is found, but does that worth it? I mean, both chcp
and ChangeCodePage
look like hacks. Wouldn't other packages affect this package's output? Or should ChangeCodePage
be called before each reg.exe
call? The latter sounds like just a faster version of current implementation.
Honestly, I don't know how ffi
works so I may be wrong.
Thanks for you reply!
It's not a good idea to make rage-edit
depend on ffi
, yes, I agee. And I only integrate ffi
in my own application.
Directly converting stdout buffer to string makes no chance to fix text encoding manually.
proc.stdout.on('data', data => stdout += data.toString())
In my own opinion, it would be much better and friendlier if there is an option Registry.decodeStdout
(function).
How do you think about this?
Oh, that's what you mean. But data
still depends on default code page, at least in my case.
On my current PC the default code page is 866
(Russian) and I can read cyrillic letters well using your method and a bit of iconv-lite's magic:
proc.stdout.on('data', data => stdout += iconv.decode(data, 'cp866'))
But when I try to read something else (for example, è°·æŒç¿»è¯å¹«åŠ©æˆ‘
), it turns into a bunch of nonconvertible data:
await Registry.get('HKLM\\SOFTWARE\\aaaa', 'Unicode')
==>
// No encodings, just turning Buffer into a hex string
proc.stdout.on('data', data => stdout += data.toString('hex'))
==>
0d0a484b45595f4c4f43414c5f4d414348494e455c534f4654574152455c616161610d0a20202020556e69636f6465202020205245475f535a202020203f3f3f3f3f3f3f0d0a0d0a
==>
HKEY_LOCAL_MACHINE\SOFTWARE\aaaa
Unicode REG_SZ ???????
In my case è°·æŒç¿»è¯å¹«åŠ©æˆ‘
turned into ???????
(3f3f3f3f3f3f3f
in hex) because cp866
doesn't support those characters.
So, the only thing you can do to raw data is convert it from your system code page to utf8 (which most probably could be automated). And again, it's just my own research so I may be wrong.
Вот, automating this process is a really great idea!
Get Windows console codepage reveals how to use Windows API GetConsoleOutputCP and SetConsoleOutputCP. And in document of SetConsoleOutputCP, Microsoft tells us the current code page is stored in HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage\ACP
:
We can query it out and then apply iconv-lite decoding.
In my case (and in your too it seems) it's OEMCP
, not ACP
:
Well, we will see what can be done. All of that looks like a good "lite version" of unicode
option.
I just did some tests and:
- Raw
reg.exe
calls take ≈15ms; -
chcp 65001 + reg.exe + chcp (old value)
takes ≈50ms; -
iconv.decode(data, 'cp866')
takes ≈45ms.
So both chcp
and iconv
are almost equally slow compared to "raw calls", while chcp
gives more abilities (full confidence that data is not corrupted). I doubt that an app would be interrupted after the first chcp
call and before the second one. Of course there's a tiny chance of that, but even the get-port package has a similar "failure" chance, and that's not the package's fail.
I want to say, either Registry.decodeStdout
or iconv
can be added, but again, would that worth it? Both of them depend on system. If an app counts on 936
code page, but system locale is English (437
), you won't be able to just decode input from cp936
(just like chcp 936
will always say "Invalid code page"), and decoding from 437
is pointless. There are too many cases where those options can behave different in different situations. In my opinion, for public usage iconv
/decodeStdout
is not safe, and for personal usage chcp
is already not that bad.
In short, the idea looks good for particular cases, but it would make code bloated without any great benefit globally. I wonder what @MikeKovarik thinks about all of that.