[Platform win32] Fix crash when pipe encoding is set to None
If stdout or stderr argument is of type io.stringIO, the function crash because stringIO has it's encoding property set to None. This issue has been introduced by this commit in version 4.7.0
Contributor Checklist:
- [ ] I have created a new test or updated the unit tests to cover the new/changed functionality.
- [x] I have updated
CHANGES.txt(and read theREADME.rst) - [ ] I have updated the appropriate documentation
Can you add a test? And a blurb to RELEASE.txt and CHANGES.txt ?
What's causing stdout.encoding to be None?
What's causing stdout.encoding to be None?
That's just the way it is (default) - from Python, not from us.
Can you add a test? And a blurb to RELEASE.txt and CHANGES.txt ?
I updated RELEASE.txt and CHANGES.txt. For the test is it not working yet and I am going in holiday for 3 weeks .
The proposed changes in this PR are similar to an issue that arose in the Godot project. The Godot issue was based on an implementation proposed in a SCons PR that was closed and subsequently implemented in Godot.
For reference:
- https://github.com/godotengine/godot/issues/91883#issue-2291482086
- https://github.com/godotengine/godot/pull/91890#issuecomment-2107368041
Excerpt from the Godot issue [1] proposed implementation which may apply here as well:
The binary read of the temporary stdout file followed by a decoded write to the stderr stream will likely contain extra newline characters for each line.
The binary read of the temporary stdout file contains CRLF (i.e.,
\r\n) line endings. The CRLF sequence remains after decoding. This effectively causes an extra new line for each line of content written.There may be unnecessary unicode replacement characters introduced in the written stderr content due to the implemented encoding.
The encoding selected is that of python's stdout stream. However, the temporary stdout file is populated by the output of the msvc compiler and linker commands. SCons uses encoding "oem" in some cases when processing msvc command output.
While the above applied specifically to msvc compiler and linker invocations, the general win32 redirected output is from the invoked program's stdout/stderr redirected to temporary files. The console encoding likely should be used for decoding the temporary output file reads.
It might be worth testing the following:
if stdout is not None and not stdoutRedirected:
try:
with open(tmpFileStdoutName, "rb") as tmpFileStdout:
output = tmpFileStdout.read()
- stdout.write(output.decode(stdout.encoding, "replace"))
+ stdout.write(output.decode("oem", "replace").replace("\r\n", "\n"))
os.remove(tmpFileStdoutName)
except OSError:
pass
if stderr is not None and not stderrRedirected:
try:
with open(tmpFileStderrName, "rb") as tmpFileStderr:
errors = tmpFileStderr.read()
- stderr.write(errors.decode(stderr.encoding, "replace"))
+ stderr.write(errors.decode("oem", "replace").replace("\r\n", "\n"))
os.remove(tmpFileStderrName)
except OSError:
pass
It also might be worth checking if multi-line writes to the redirected temporary files actually contain extra newline characters following a decode without the explicit replacement (e.g. .replace("\r\n", "\n")).
I know in some of the tests done when writing the Godot isssue, there were more unicode replacements when using a native Windows language other than English (e.g., German) prior to changing to "oem".
As always, I could be really wrong.
The following example illustrates the CRLF issue when decoding binary reads of the console output.
Note: any differences in the actual decoding method ("utf-8", "oem", etc., ...), if any, will be addressed in a future example.
Source file:
#include <stdio.h>
int main(int argc, char* argv) {
printf("hello");
return 0;
}
SConstruct:
import SCons
import sys
import io
import atexit
mystdout = StringIO()
mystderr = StringIO()
def dump_stringio():
mystderr.seek(0)
mystdout.seek(0)
print()
print("---------STDERR----------")
print(mystderr.read())
print("---------STDOUT----------")
print(mystdout.read())
print("-------------------------")
atexit.register(dump_stringio)
def piped_spawn(sh, escape, cmd, args, env):
from SCons.Platform.win32 import piped_spawn as scons_piped_spawn
rval = scons_piped_spawn(sh, escape, cmd, args, env, mystdout, mystderr)
return rval
class EnvironmentFactory:
program = 'hello'
program_dir = './src'
program_files = ['hello.c']
env_list = []
@classmethod
def make_program(cls, **kwargs):
build_n = len(cls.env_list) + 1
build = '_build{:03d}'.format(build_n)
print('Build:', build, kwargs, file=sys.stdout)
VariantDir(build, cls.program_dir, duplicate=0)
env=Environment(tools=['msvc', 'mslink'], **kwargs)
env["SPAWN"] = piped_spawn
build += '/'
env.Program(build + cls.program, [build + filename for filename in cls.program_files])
cls.env_list.append(env)
return env
for kwargs in [
{'MSVC_VERSION': '14.3', 'CCFLAGS': '/nologo /showIncludes /fakecl', 'LINKFLAGS': '/nologo /fakelink'},
]:
EnvironmentFactory.make_program(**kwargs)
PR decoding:
stdout.write(output.decode(stdout.encoding if stdout.encoding is not None else 'utf-8', "replace"))stderr.write(errors.decode(stderr.encoding if stderr.encoding is not None else 'utf-8', "replace"))
scons: Reading SConscript files ...
Build: _build001 {'MSVC_VERSION': '14.3', 'CCFLAGS': '/nologo /showIncludes /fakecl', 'LINKFLAGS': '/nologo /fakelink'}
scons: done reading SConscript files.
scons: Building targets ...
cl /Fo_build001\hello.obj /c src\hello.c /nologo /showIncludes /fakecl
link /nologo /fakelink /OUT:_build001\hello.exe _build001\hello.obj
scons: done building targets.
---------STDERR----------
cl : Command line warning D9002 : ignoring unknown option '/fakecl'
---------STDOUT----------
hello.c
Note: including file: C:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\ucrt\stdio.h
Note: including file: C:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\ucrt\corecrt.h
Note: including file: C:\Software\MSVS-2022-143-Com\VC\Tools\MSVC\14.40.33807\include\vcruntime.h
Note: including file: C:\Software\MSVS-2022-143-Com\VC\Tools\MSVC\14.40.33807\include\sal.h
Note: including file: C:\Software\MSVS-2022-143-Com\VC\Tools\MSVC\14.40.33807\include\concurrencysal.h
Note: including file: C:\Software\MSVS-2022-143-Com\VC\Tools\MSVC\14.40.33807\include\vadefs.h
Note: including file: C:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\ucrt\corecrt_wstdio.h
Note: including file: C:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\ucrt\corecrt_stdio_config.h
LINK : warning LNK4044: unrecognized option '/fakelink'; ignored
-------------------------
Suggested decoding:
stdout.write(output.decode("oem", "replace").replace("\r\n", "\n"))stderr.write(errors.decode("oem", "replace").replace("\r\n", "\n"))
scons: Reading SConscript files ...
Build: _build001 {'MSVC_VERSION': '14.3', 'CCFLAGS': '/nologo /showIncludes /fakecl', 'LINKFLAGS': '/nologo /fakelink'}
scons: done reading SConscript files.
scons: Building targets ...
cl /Fo_build001\hello.obj /c src\hello.c /nologo /showIncludes /fakecl
link /nologo /fakelink /OUT:_build001\hello.exe _build001\hello.obj
scons: done building targets.
---------STDERR----------
cl : Command line warning D9002 : ignoring unknown option '/fakecl'
---------STDOUT----------
hello.c
Note: including file: C:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\ucrt\stdio.h
Note: including file: C:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\ucrt\corecrt.h
Note: including file: C:\Software\MSVS-2022-143-Com\VC\Tools\MSVC\14.40.33807\include\vcruntime.h
Note: including file: C:\Software\MSVS-2022-143-Com\VC\Tools\MSVC\14.40.33807\include\sal.h
Note: including file: C:\Software\MSVS-2022-143-Com\VC\Tools\MSVC\14.40.33807\include\concurrencysal.h
Note: including file: C:\Software\MSVS-2022-143-Com\VC\Tools\MSVC\14.40.33807\include\vadefs.h
Note: including file: C:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\ucrt\corecrt_wstdio.h
Note: including file: C:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\ucrt\corecrt_stdio_config.h
LINK : warning LNK4044: unrecognized option '/fakelink'; ignored
-------------------------
The following example illustrates a decoding issue when using "utf-8" compared to "oem".
When the language is English there are no differences. When the language is German there are differences.
SConstruct changes:
force_encode_decode = False
def dump_stringio():
mystderr.seek(0)
mystdout.seek(0)
errors = mystderr.read()
output = mystdout.read()
if force_encode_decode:
errors = errors.encode(sys.stdout.encoding, errors="replace").decode(sys.stdout.encoding)
output = output.encode(sys.stdout.encoding, errors="replace").decode(sys.stdout.encoding)
print()
print("---------STDERR----------")
print(errors)
print("---------STDOUT----------")
print(output)
print("-------------------------")
atexit.register(dump_stringio)
...
for kwargs in [
{'MSVC_VERSION': '14.3', 'CCFLAGS': '/nologo /showIncludes /fakecl', 'LINKFLAGS': '/nologo /SUBSYSTEM:CONSOLE,4.0 /fakelink'},
]:
EnvironmentFactory.make_program(**kwargs)
PR decoding:
-
English
decode="utf-8", force_encode_decode=FalseLINK : warning LNK4010: invalid subsystem version number 4.0; default subsystem version assumed
-
German:
-
decode="utf-8", force_encode_decode=False (unicode error printing)UnicodeEncodeError: 'charmap' codec can't encode character '\ufffd' in position 1047: character maps to <undefined> -
decode="utf-8", force_encode_decode=True (replacement character encode/decode)LINK : warning LNK4010: Ung?ltige Versionsnummer 4.0; Standardversion des Subsystems wird angenommen. -
decode="oem", force_encode_decode=FalseLINK : warning LNK4010: Ungültige Versionsnummer 4.0; Standardversion des Subsystems wird angenommen.
-
Suggested decoding:
-
English:
decode="oem", force_encode_decode=FalseLINK : warning LNK4010: invalid subsystem version number 4.0; default subsystem version assumed
-
German:
- decode="oem", force_encode_decode=False`
LINK : warning LNK4010: Ungültige Versionsnummer 4.0; Standardversion des Subsystems wird angenommen.
- decode="oem", force_encode_decode=False`
Right... and MSCommon uses oem, as we were advised by someone a while back, once we got to Python versions where that was consistently supported (3.6+). So is there a problem with this?
Right... and MSCommon uses oem, as we were advised by someone a while back, once we got to Python versions where that was consistently supported (3.6+). So is there a problem with this?
The first issue is that when passing a StringIO object (e.g., obj = StringIO()) for stdout/stderr in the piped spawn is that the obj.encoding is None.
I believe that decode should just use "oem". The temporary file contents are populated by redirecting the output from the invoked process. I believe that the console encoding should be used rather than the python stream encoding.
Instead of using the stream encoding as in this PR:
stdout.write(output.decode(stdout.encoding if stdout.encoding is not None else 'utf-8', "replace")) and
stderr.write(errors.decode(stderr.encoding if stderr.encoding is not None else 'utf-8', "replace"))
Use the "oem" encoding (like MSCommon):
stdout.write(output.decode("oem", "replace")) and
stderr.write(errors.decode("oem", "replace"))
If the temporary file reads remain in binary, then the explicit CR LF sequence needs to be replaced with a NL to avoid the "extra" blank lines:
stdout.write(output.decode("oem", "replace").replace("\r\n", "\n")) and
stderr.write(errors.decode("oem", "replace").replace("\r\n", "\n"))
The changes for this PR are:
if stdout is not None and not stdoutRedirected:
try:
with open(tmpFileStdoutName, "rb") as tmpFileStdout:
output = tmpFileStdout.read()
- stdout.write(output.decode(stdout.encoding, "replace"))
+ stdout.write(output.decode(stdout.encoding if stdout.encoding is not None else 'utf-8', "replace"))
os.remove(tmpFileStdoutName)
except OSError:
pass
if stderr is not None and not stderrRedirected:
try:
with open(tmpFileStderrName, "rb") as tmpFileStderr:
errors = tmpFileStderr.read()
- stderr.write(errors.decode(stderr.encoding, "replace"))
+ stderr.write(errors.decode(stderr.encoding if stderr.encoding is not None else 'utf-8', "replace"))
os.remove(tmpFileStderrName)
except OSError:
pass
I believe that this may be more robust:
if stdout is not None and not stdoutRedirected:
try:
with open(tmpFileStdoutName, "rb") as tmpFileStdout:
output = tmpFileStdout.read()
- stdout.write(output.decode(stdout.encoding, "replace"))
+ stdout.write(output.decode("oem", "replace").replace("\r\n", "\n"))
os.remove(tmpFileStdoutName)
except OSError:
pass
if stderr is not None and not stderrRedirected:
try:
with open(tmpFileStderrName, "rb") as tmpFileStderr:
errors = tmpFileStderr.read()
- stderr.write(errors.decode(stderr.encoding, "replace"))
+ stderr.write(errors.decode("oem", "replace").replace("\r\n", "\n"))
os.remove(tmpFileStderrName)
except OSError:
pass
I refactored your fix, I'm not a big fan of x if xyz else abc form. This way we're clearly calling out the invalid None value.
The refactored code introduced bugs.
The stream encoding attribute may not be writable:
# Sanitize encoding. None is not a valid encoding.
# Since we're handling a redirected shell command use
# the shells default encoding.
if stdout.encoding is None:
stdout.encoding = 'oem' <==== StringIO encoding not writable
if stderr.encoding is None:
stderr.encoding = 'oem' <==== StringIO encoding not writable
Writing to the encoding attribute of a StringIO object is not allowed:
stdout.encoding = 'oem'
AttributeError: attribute 'encoding' of '_io._TextIOBase' objects is not writable
The original implementation didn't attempt to assign the encoding but rather picked the encoding to pass to the decode call. If one really wants to use the stdout.encoding and stderr.encoding values if they are not None, then temporaries values should be used.
For example:
...
# Since we're handling a redirected shell command use
# the shells default encoding.
if stdout.encoding is None:
stdout_encoding = 'oem'
else:
stdout_encoding = stdout.encoding
if stderr.encoding is None:
stderr_encoding = 'oem'
else:
stderr_encoding = stderr.encoding
...
stdout.write(output.decode(stdout_encoding, "replace").replace("\r\n", "\n"))
...
stderr.write(errors.decode(stderr_encoding, "replace").replace("\r\n", "\n"))
But...
I'm not sure we want to use stdout.encoding and stderr.encoding even if they are not None to decode the temporary files which are the result of redirected output from a (likely) external command.
It may be a good idea to always use "oem" rather than what may be set for the python stream encoding. The temporary file is likely being written by an external process and not necessarily the current python.
Using the python stream encoding versus an external process console encoding seems like an apples versus oranges type of issue.
SCons\Tool\MSCommon\common simply uses "oem" to decode subprocess (not temporary file) output:
# Ongoing problems getting non-corrupted text led to this
# changing to "oem" from "mbcs" - the scripts run presumably
# attached to a console, so some particular rules apply.
OEM = "oem"
if cp.stderr:
# TODO: find something better to do with stderr;
# this at least prevents errors from getting swallowed.
sys.stderr.write(cp.stderr.decode(OEM))
if cp.returncode != 0:
raise OSError(cp.stderr.decode(OEM))
return cp.stdout.decode(OEM)
I suggest removing stdout.encoding and stderr.encoding entirely and always using "oem" for decoding the temporary file content:
- # Sanitize encoding. None is not a valid encoding.
- # Since we're handling a redirected shell command use
- # the shells default encoding.
- if stdout.encoding is None:
- stdout.encoding = 'oem'
- if stderr.encoding is None:
- stderr.encoding = 'oem'
...
- stdout.write(output.decode(stdout.encoding, "replace").replace("\r\n", "\n"))
+ stdout.write(output.decode("oem", "replace").replace("\r\n", "\n"))
...
- stderr.write(errors.decode(stderr.encoding, "replace"))
+ stderr.write(errors.decode("oem", "replace").replace("\r\n", "\n"))
@mwichmann Any thoughts?
@jcbrill as usual your suggested fix and analysis are well thought out! If I'm understanding correctly though, there's no good way to test this unless we fired up a VM where the system default language was german?
If I'm understanding correctly though, there's no good way to test this unless we fired up a VM where the system default language was german?
Exactly.
That is how the output fragments above were produced. A windows 11 VMWare virtual machine with german as the default system language and necessary language packs installed.
German was the first non-English system language that I tried when trying to find any encoding issues with the proposed spawn changes for the godot project and the SCons PR that was eventually rejected due to being implemented in/around the spawn code for win32.
Producing compiler warnings and error messages illustrated the extra newline issue.
The details are a little fuzzy now, but there may have also been reliance on either an english word or a phrase order that changed with the language for the original godot-like proposed changes.
Shown above, one of the german warning messages contains the ü character which was replaced with ? when decoding.
Amusingly (maybe), the last time we had a go-round with Windows and non-ascii characters, I set up the Swedish language support, as it's a native language for me, where German is only a learned language (and largely unused for many many years). But it turned out Swedish is not a language the compiler suite produces i18n messages for, so I had to switch that test setup to German (I no longer have that VM)