PlatformIO STM32 Build Fails on Windows Chinese Version: TMP File Encoding Issue
Describe the bug When I use PlatformIO to compile an STM32 program, I encounter a linking failure error. After debugging, I found that when the command becomes too long, the command arguments are written into a temporary file using the TempFileMunge method. However, since I'm using the Chinese version of Windows, the default encoding is cp936. If the temporary file is written using UTF-8 encoding, it causes errors when gcc tries to read it. On the other hand, if the temporary file is written using GBK encoding, everything works fine.
os.write(fd, bytearray(join_char.join(args) + "\n", 'gbk'))
Therefore, the correct encoding should be selected based on the system's default locale when writing the temporary file.
Required information
- Link to SCons Users thread discussing your issue.
- Version of SCons scons-local-4.8.1
- Version of Python Python 3.11.7 (tags/v3.11.7:fa7a6f2, Dec 4 2023, 19:24:49) [MSC v.1937 64 bit (AMD64)] on win32
- Which python distribution if applicable (python.org, cygwin, anaconda, macports, brew,etc)
- How you installed SCons
- What Platform are you on? (Linux/Windows and which version) Windows 10 Pro 22H2 19045.6093 (Chinese Version)
- How to reproduce your issue? Please include a small self contained reproducer. Likely a SConstruct should do for most issues.
- How you invoke scons (The command line you're using "scons --flags some_arguments")
You should file your issue with PlatformIO! Also 4.8.1 is not the newest version of SCons. Please try with 4.9.1
If 4.9.1 doesn't fix it, let us know and we'll take a look.
I've tried it already. PlatformIO uses the package [email protected], which references [email protected], so I can't simply upgrade it separately. However, I looked at the relevant code in version 4.9.1, and the temporary file is still written using UTF-8 encoding. Therefore, I suspect that compatibility issues still exist.
os.write(fd, bytearray(join_char.join(args) + "\n", encoding="utf-8"))
@yehjf what does the following python code output on your system
import sys
print(sys.getdefaultencoding())
Display utf-8
After my debugging, I found that after writing the temporary file (which contains Chinese characters), the command is eventually executed via the spawnve method:
C:\WINDOWS\System32\cmd.exe /C arm-none-eabi-g++ @D:\\Desktop\\测试\\firmware\\.pio\\build\\bluepill_f103c6\\tmpagpc__cg.tmp
The key issue is that the g++ program reads this temporary file using GBK encoding. I suspect that g++ uses the default encoding from the Windows locale settings when opening the tmp file.
I tried running the command chcp 65001 before execution to switch to UTF-8 mode, but it still didn't work. However, when I changed the encoding to GBK when writing the tmp file, the command was executed successfully.
By default, the result of chcp on my system is 936, which corresponds to GBK encoding.
Yeah, these days Python itself defaults to utf-8 (for the most part), and we're forcing it in that case, that's not necessarily the case for an external command.
Can you share the output of "set" in a cmd shell on your computer?
On Wed, Jul 16, 2025 at 5:44 AM Mats Wichmann @.***> wrote:
mwichmann left a comment (SCons/scons#4746) https://github.com/SCons/scons/issues/4746#issuecomment-3078418299
Yeah, these days Python itself defaults to utf-8 (for the most part), and we're forcing it in that case, that's not necessarily the case for an external command.
— Reply to this email directly, view it on GitHub https://github.com/SCons/scons/issues/4746#issuecomment-3078418299, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAUCNHPJSTSEB3KZR76K7D3IZCLRAVCNFSM6AAAAACBQV6MROVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTANZYGQYTQMRZHE . You are receiving this because you commented.Message ID: @.***>
ALLUSERSPROFILE=C:\ProgramData
ANDROID_SDK_ROOT=D:\android-sdk
APPDATA=C:\Users\Hello\AppData\Roaming
CommonProgramFiles=C:\Program Files\Common Files
CommonProgramFiles(x86)=C:\Program Files (x86)\Common Files
CommonProgramW6432=C:\Program Files\Common Files
COMPUTERNAME=DESKTOP-P357RS7
ComSpec=C:\WINDOWS\system32\cmd.exe
DriverData=C:\Windows\System32\Drivers\DriverData
GHIDRA_INSTALL_DIR=D:\scoop\apps\ghidra\current
GIT_INSTALL_ROOT=D:\scoop\apps\git\current
GRADLE_USER_HOME=D:\scoop\apps\gradle\current\.gradle
HADOOP_HOME=D:\hadoop-3.3.5
HOMEDRIVE=C:
HOMEPATH=\Users\Hello
JAVA_HOME=C:\Program Files\Java\jdk1.8.0_181
LOCALAPPDATA=C:\Users\Hello\AppData\Local
LOGONSERVER=\\DESKTOP-P357RS7
NUMBER_OF_PROCESSORS=6
OneDrive=C:\Users\Hello\OneDrive
OneDriveConsumer=C:\Users\Hello\OneDrive
OS=Windows_NT
Path=D:\scoop\apps\shims;C:\WINDOWS\system32;C:\WINDOWS;C:\WINDOWS\System32\Wbem;C:\WINDOWS\System32\WindowsPowerShell\v1.0\;C:\WINDOWS\System32\OpenSSH\;C:\Program Files\Java\jdk1.8.0_181\bin;C:\Program Files (x86)\Windows Kits\10\Debuggers\x64;C:\Program Files (x86)\ZeroTier\One\;C:\Program Files\dotnet\;E:\metasploit-framework\bin\;C:\Program Files\Docker\Docker\resources\bin;D:\hadoop-3.3.5\bin;D:\scoop\apps\python36\current\Scripts;D:\scoop\apps\python36\current;D:\scoop\apps\nodejs\current\bin;D:\scoop\apps\nodejs\current;D:\scoop\apps\vscode\current\bin;D:\scoop\apps\python39\current\Scripts;D:\scoop\apps\python39\current;D:\scoop\apps\openjdk\current\bin;D:\scoop\apps\python310\current\Scripts;D:\scoop\apps\python310\current;D:\scoop\apps\python\current\Scripts;D:\scoop\apps\python\current;C:\Users\Hello\go\bin;D:\scoop\shims;C:\Users\Hello\AppData\Local\Microsoft\WindowsApps;C:\Users\Hello\.dotnet\tools
PATHEXT=.COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH;.MSC
PROCESSOR_ARCHITECTURE=AMD64
PROCESSOR_IDENTIFIER=Intel64 Family 6 Model 158 Stepping 13, GenuineIntel
PROCESSOR_LEVEL=6
PROCESSOR_REVISION=9e0d
ProgramData=C:\ProgramData
ProgramFiles=C:\Program Files
ProgramFiles(x86)=C:\Program Files (x86)
ProgramW6432=C:\Program Files
PROMPT=$P$G
PSModulePath=C:\Program Files\WindowsPowerShell\Modules;C:\WINDOWS\system32\WindowsPowerShell\v1.0\Modules
PUBLIC=C:\Users\Public
PYSPARK_DRIVER_PYTHON=D:\scoop\apps\python\current\python.exe
PYSPARK_PYTHON=D:\scoop\apps\python\current\python.exe
SBT_HOME=D:\scoop\apps\sbt\current
SCALA_HOME=D:\scoop\apps\scala2\current
SESSIONNAME=Console
SystemDrive=C:
SystemRoot=C:\WINDOWS
TEMP=C:\Users\Hello\AppData\Local\Temp
TMP=C:\Users\Hello\AppData\Local\Temp
USERDOMAIN=DESKTOP-P357RS7
USERDOMAIN_ROAMINGPROFILE=DESKTOP-P357RS7
USERNAME=Hello
USERPROFILE=C:\Users\Hello
windir=C:\WINDOWS
WIRESHARK_CONFIG_DIR=D:\scoop\apps\wireshark\current\Data
WIRESHARK_DATA_DIR=D:\scoop\apps\wireshark\current\Data
ZES_ENABLE_SYSMAN=1
How about try this:
import locale
default_encoding = locale.getpreferredencoding(False)
print(f"The Windows default encoding for Python is: {default_encoding}")
The Windows default encoding for Python is: cp936
To some of the above: sys.getdefaultencoding() is not useful, it's defined to return utf-8 these days. That's for Python. locale.getencoding() is far more interesting.
sys.getdefaultencoding()Return 'utf-8'. This is the name of the default string encoding, used in methods likestr.encode.
I wonder if the answer is to just omit the encoding, letting it default? Hard to test for me, I only have English Windows, and on Linux it's hard to force things into non-UTF-8.
I only have English Windows
With a VMWare or VirtualBox Windows VM, it would be possible to change the Windows language/locale for testing.
I did that for some testing a while back using German. I think the language and keyboard was changed I don't remember exactly. Not knowing the language made using the keyboard a challenge to launch the tests though.
I've done that too, but it needs to be a different enough locale to trigger the problem.
Wouldn't setting Windows to a locale that uses cp936 be enough (e.g., simplified Chinese)?
Steps:
- Cloned Base Windows 11 VM
- Installed simplified Chinese language pack,
- Changed Windows display language to simplified Chinese
- Changed system locale for non-Unicode programs to simplified Chinese.
- Restarted windows
Script:
import sys
import locale
default_encoding = locale.getpreferredencoding(False)
print(f"The Windows default encoding for Python is: {default_encoding}")
sys_default_encoding = sys.getdefaultencoding()
print(f"The Python default encoding is: {sys_default_encoding}")
Output:
The Windows default encoding for Python is: cp936
The Python default encoding is: utf-8
I may be able to help with testing. I am time limited for the next few weeks, but given instructions, could probably run tests as desired.
A reproducer is not that trivial for us to set up - we need to cause a file to be written in utf-8 encoding that a non-Python program is going to read and fail while using the active (non-utf-8) encoding. The example cited here is a gcc response file and the program is g++ from the ARM cross-toolchain, and the encoding is 'gbk' (which on Windows, is indeed cp936). It's easy enough to simulate creating such a file, but it needs to end up containing characters that won't decode with gbk - which I'm surprised happens in constructing a compile command. My limited understanding is the bottom 127 characters are still the standard ASCII set, which is all I'd expect to be produced in a command line.
You know more about this than I do.
Is it possible that there might be something like a literal string on the command-line (e.g., defines, filenames, etc) that are valid in utf-8 but not in cp936? Utf-8 is "much larger" than cp936.
Multibyte is handled differently in the two. So incompatibility is not surprising, but only if something is producing such characters for the command line.
SCons TempFileMunge:
os.write(fd, bytearray(join_char.join(args) + "\n", encoding="utf-8"))
OP's command-line:
C:\WINDOWS\System32\cmd.exe /C arm-none-eabi-g++ @D:\\Desktop\\测试\\firmware\\.pio\\build\\bluepill_f103c6\\tmpagpc__cg.tmp
The key issue is that the g++ program reads this temporary file using GBK encoding. I suspect that g++ uses the default encoding from the Windows locale settings when opening the tmp file.
I suspect the OP is correct: the encoding of the directory name in the middle of the path produces different byte sequences in utf-8 and cp936/bgk.
OP path:
C:\WINDOWS\System32\cmd.exe /C arm-none-eabi-g++ @D:\\Desktop\\测试\\firmware\\.pio\\build\\bluepill_f103c6\\tmpagpc__cg.tmp
Translated:
C:\WINDOWS\System32\cmd.exe /C arm-none-eabi-g++ @D:\\Desktop\\Test\\firmware\\.pio\\build\\bluepill_f103c6\\tmpagpc__cg.tmp
Code fragment for directory name 测试 (test):
dirname_string = "测试"
print(f"utf8: {dirname_string.encode('utf-8')}")
print(f"cp936: {dirname_string.encode('cp936')}")
print(f"gbk: {dirname_string.encode('gbk')}")
Output:
utf8: b'\xe6\xb5\x8b\xe8\xaf\x95'
cp936: b'\xb2\xe2\xca\xd4'
gbk: b'\xb2\xe2\xca\xd4'
It appears to be a problem if the temp file is written in utf-8 and gcc is "falling back" to the default locale settings cp936/gbk when the source file contains elements that are not compatible between the two encodings.
As always, I could be wrong...
SCons TempFileMunge:
os.write(fd, bytearray(join_char.join(args) + "\n", encoding="utf-8"))OP's command-line:
C:\WINDOWS\System32\cmd.exe /C arm-none-eabi-g++ @D:\\Desktop\\测试\\firmware\\.pio\\build\\bluepill_f103c6\\tmpagpc__cg.tmpThe key issue is that the g++ program reads this temporary file using GBK encoding. I suspect that g++ uses the default encoding from the Windows locale settings when opening the tmp file.
I suspect the OP is correct: the encoding of the directory name in the middle of the path produces different byte sequences in
utf-8andcp936/bgk.OP path:
C:\WINDOWS\System32\cmd.exe /C arm-none-eabi-g++ @D:\\Desktop\\测试\\firmware\\.pio\\build\\bluepill_f103c6\\tmpagpc__cg.tmpTranslated:C:\WINDOWS\System32\cmd.exe /C arm-none-eabi-g++ @D:\\Desktop\\Test\\firmware\\.pio\\build\\bluepill_f103c6\\tmpagpc__cg.tmp
Okay, so there are a few possibilitiies. I was assuming it was the contents of the response file, but maybe it's the path to the response file? NTFS paths are always utf-8 (at least according to Microsoft itself). FAT filesystems may use the locale setting. But presumably, to work correctly on Windows, the gcc build can handle a command line containing utf-8-encoded pathnames (note when we issue commands to Windows, we're still using the ancient spawnve method - as already noted earlier in this issue. I don't know quite what that does as far as encodings, but it doesn't give you the control subprocess does. ) So it feels more likely it's pathnames encoded in the response file that are giving the problems. What do I know.
I was assuming it was the contents of the response file, but maybe it's the path to the response file?
I'm not sure, if it is the path to the response file, the contents of the response file, or both.
Given the example path, if the command generated in the response file contained one or more paths (e.g., source file, lib paths, etc) with file system names that are different when encoded with utf8 and gbk there could be issues depending on what gcc is expecting.
If using locale.getpreferredencoding(False) doesn't break utf-8 systems, and fixes the users, I think we can just make the change and run full test suite here and call it a day.
If using locale.getpreferredencoding(False) doesn't break utf-8 systems, and fixes the users, I think we can just make the change and run full test suite here and call it a day.
Certainly the path of least energy.
The test suite should probably be run in environments with locales other than English though.
Should that not work, one could test if the original argument string can be round-tripped through the locale preferred encoding.
In a nutshell:
- encode argument string using
utf-8 - decode bytes to a temporary string using
locale.getpreferredencoding(False) - compare the decoded temporary string to the original argument string:
- if EQUAL: return utf-8 encoded bytes
- if !EQUAL: return
locale.getpreferredencoding(False)bytes
Script for exposition purposes:
import locale
import sys
print(f"{locale.getpreferredencoding()=}")
test_strings = ["test"]
if locale.getpreferredencoding(False) == "cp936":
test_strings.append("测试")
_default_encoding = None
def _windows_encode_string(output_string):
global _default_encoding
if _default_encoding is None:
_default_encoding = locale.getpreferredencoding(False)
output_encoding = "utf-8"
output_bytes = output_string.encode(output_encoding, errors="strict")
roundtrip_string = output_bytes.decode(_default_encoding, errors="ignore")
if roundtrip_string != output_string:
try:
output_bytes = output_string.encode(_default_encoding, errors="strict")
output_encoding = _default_encoding
except UnicodeEncodeError:
pass
return output_bytes, output_encoding
_is_windows = sys.platform == "win32"
def _encode_string(output_string):
global _is_windows
if _is_windows:
output_bytes, output_encoding = _windows_encode_string(output_string)
else:
output_bytes = output_string.encode("utf-8")
output_encoding = "utf-8"
return output_bytes, output_encoding
for test_string in test_strings:
print()
output_bytes, output_encoding = _encode_string(test_string)
print(f"output string: {test_string}")
print(f"output encoding: {output_encoding}")
print(f"output bytes: {output_bytes}")
Output on windows/cp1252:
locale.getpreferredencoding()='cp1252'
output string: test
output encoding: utf-8
output bytes: b'test'
Output on window/cp963:
locale.getpreferredencoding()='cp936'
output string: test
output encoding: utf-8
output bytes: b'test'
output string: 测试
output encoding: cp936
output bytes: b'\xb2\xe2\xca\xd4'
The algorithm would have to be reviewed in the face of unicode encode/decode exceptions.
We've had simliar in other corners of SCons before, so I think we can avoid testing multi-locale and trust our users will let us know (as just happened) and work with them to resolve. Since utf-8 seems to be the vast majority of our users..
We've had simliar in other corners of SCons before, so I think we can avoid testing multi-locale and trust our users will let us know (as just happened) and work with them to resolve. Since utf-8 seems to be the vast majority of our users..
Isn't this about changing the temp file contents encoding from utf-8 to the locale default encoding (i.e., moving away from utf-8)?
That would seem to create a potential issue with external tool consumers of the temp file that may be expecting the temp file to be utf-8 which it won't be anymore.
Not arguing, just trying to understand.
on my system this yield utf-8 ( macos), what does it yield on your system? So on my system, there's no change.
Windows: locale.getpreferredencoding()='cp1252'
I think we can allow users to choose the encoding used for writing the tmp file by specifying a parameter or a specific environment variable. Alternatively, we can automatically determine the encoding (like how gcc is handled as a special case).
I pushed a WIP PR to see how this change fares on all our CI testing platforms.
Additional research notes:
In a test VM with language pack Chinese(Simplified, China) installed and set as the current locale, the active code page is cp936.
With a similar example as provided above, Compiling "hello world" via a response file appears to work with a cp936 compatible encoding and fails with a utf-8 encoded file (as reported above).
Enabling "Beta: Use Unicode UTF-8 for worldwide language support" in the settings for "Current language for non-unicode programs" and rebooting changes the active code page to 65001 (from cp936).
The "hello world" program compiles with a utf-8 encoded response file.
Instructions for enabling::
Enabling UTF-8 Support: To enable UTF-8 support for applications, you need to change the system locale settings. This can be done by going to Settings > Time & Language > Language & Region > Administrative language settings > Change system locale and checking the "Beta: Use Unicode UTF-8 for worldwide language support" box, according to Microsoft's documentation.
Windows Settings
-> Time & Language
-> Language & Region
-> Administrative language settings
-> Administrative Tab
-> Current language for non-unicode programs
-> Change system locale button
-> Region Settings pop-up window
-> Check "Beta: Use Unicode UTF-8 for worldwide language support"
restart computer
open cmd window
chcp
This might work for SCons with the embedded arm compiler. Then again, it might not.
Warning: this could break other program usage.
I pushed a WIP PR to see how this change fares on all our CI testing platforms.
Not being argumentative, simply listing some questions to ponder.
Is passing all of the tests tantamount to "failing to reject" the hypothesis that it is working?
For example, does the test suite actually cover the issue at hand with regard to response files, encodings, and paths containing characters outside of the normal ascii range?
For example, paths with directories or source files containing umlaut characters (character codes 228, 246, 252, 196, 214, 220). I believe German also uses code page cp1252 in Windows. Any of these characters would have different encodings in cp1252 and utf-8.
Are there any currently supported tools known to expect the response file in utf-8?
Do mingw-w64 gcc tools behave differently than the embedded arm gcc tools with regards to response files and characters in the range 128-255?