`<filesystem>`: MSVC's `path` conversion from wide to narrow throws an exception (`fs::path(L"要").string()`)
Describe the bug
MSVC's std::filesystem::path conversion from wide to narrows crashing with the following snippet:
#include <iostream>
#include <filesystem>
int main() {
std::filesystem::path(L"要らない.exe").string();
std::cout << "Hello World!\n";
}
I also tried in a fresh sandbox with fresh latest VS Community installation and crash is happening on both builds x86|x64 Debug|Release:
Microsoft C++ exception: std::system_error at memory location
Also tried compiling with and without unicode enabled. Same behavior.
Expected behavior
conversion should succeed and not crash the program.
STL version
Microsoft Visual Studio Community 2022 (64-bit) - Current Version 17.11.1
The error code seems to be the following.
ERROR_NO_UNICODE_TRANSLATION1113 (0x459)
No mapping for the Unicode character exists in the target multi-byte code page.
The exception can be avoided when the source file is encoded in UTF-8 and the program is compiled with /utf-8.
Works fine for me on my "Beta: Use UTF-8 for language support" machine, and the same for Compiler Explorer (https://www.godbolt.org/z/oGMbo3YE1). The problem is most likely:
- The compiler and editor have differing notions of the source encoding, so the compiler sees a gibberish string in the source file. Using a pure-ascii encoding of the string literal (
L"\u8981\u3089\u306a\u3044.exe") will avoid this. - The active codepage (the narrow encoding the win32 APIs and therefore
pathuses at runtime) can't represent要らないso the transcoding inpath::stringfails (this is the error @frederick-vs-ja refers to above).
The error code seems to be the following.
ERROR_NO_UNICODE_TRANSLATION1113 (0x459) No mapping for the Unicode character exists in the target multi-byte code page.The exception can be avoided when the source file is encoded in UTF-8 and the program is compiled with
/utf-8.
Yes, that I already did and could still reproduce. Try disabling the beta feature in your Region system settings and then restart your computer. You'll be able to reproduce. I tried with both, where the file is UTF8 encoded with and without BOM. Also of course the /utf-8 compiler flag or even explictly defining the codepage for Korean characters like /source-charset:utf-8 /execution-charset:.949 based on the docs: https://learn.microsoft.com/en-us/cpp/build/reference/utf-8-set-source-and-executable-character-sets-to-utf-8?view=msvc-170
Works fine for me on my "Beta: Use UTF-8 for language support" machine, and the same for Compiler Explorer (https://www.godbolt.org/z/oGMbo3YE1).
Of course that works, but that's unfortunately not a solution I can instruct the consumers of my application to use. I tried that on Godbolt before submitting the ticket and it indeed worked. I believe (not sure) the reason is because they don't use natively Windows machines, might be some bootstrapped / dockerized system or possibly already have UTF-8 for language support enabled. I remember Matt Godbolt was talking about some of these challenges with msvc in one of his talks. Can be done also via powershell:
Set-ItemProperty -Path 'HKLM:\SYSTEM\CurrentControlSet\Control\Nls\CodePage' -Name 'ACP' -Value '65001'
Set-ItemProperty -Path 'HKLM:\SYSTEM\CurrentControlSet\Control\Nls\CodePage' -Name 'OEMCP' -Value '65001'
Set-ItemProperty -Path 'HKLM:\SYSTEM\CurrentControlSet\Control\Nls\CodePage' -Name 'MACCP' -Value '65001'
Here he talks further about some of the changes they made: https://xania.org/202407/msvc-on-ce Since it was a recent article, who knows, maybe now they actually use wine on Linux.
I'm not sure there's a bug here. If the library is throwing ERROR_NO_UNICODE_TRANSLATION to tell you that there are characters in the path that can't be represented in the active codepage, that's not a crash but expected behavior. If you believe the library is incorrect we need some more information to reproduce the problem. What is int(__std_fs_code_page())? What is the active code page in the console when the program runs?
We talked about this at the weekly maintainer meeting - I agree with Casey that this sounds by design but we need more info.
The filesystem codepage, the source character set (which is a non-issue if you use universal-character-names for your repro), and the execution character set (the last two are controlled by /utf-8 which we strongly recommend), are relevant here. Casey and I now believe that the console code page is not relevant (the repro doesn't attempt to write Unicode to the console, and if it had to for diagnostic purposes, compiling with /utf-8 and using <print> would write Unicode without introducing wacky questions about the console code page).
Having the same issue trying to iterate over a directory containing files with UTF8 names.
#include <filesystem>
#include <iostream>
int main(int argc, const char** argv) {
const auto* dir = argc > 1 ? argv[1] : ".";
auto it = std::filesystem::directory_iterator{ dir };
for (const auto& entry : it) {
std::cout << entry << "\n";
}
}
> mkdir norepro
> '' > norepro/yellow.txt
> main.exe norepro
norepro\yellow.txt
> mkdir repro
> '' > repro/żółć.txt
> main.exe repro
# crash
Works fine on Linux, also works on Windows when using MINGW (save for corrupted output).
Set-ItemProperty -Path 'HKLM:\SYSTEM\CurrentControlSet\Control\Nls\CodePage' -Name 'ACP' -Value '65001' Set-ItemProperty -Path 'HKLM:\SYSTEM\CurrentControlSet\Control\Nls\CodePage' -Name 'OEMCP' -Value '65001' Set-ItemProperty -Path 'HKLM:\SYSTEM\CurrentControlSet\Control\Nls\CodePage' -Name 'MACCP' -Value '65001'
This fixes the crash but still leaves the output corrupted which requires calling SetConsoleOutputCP(CP_UTF8) or adding -utf-8 to compile flags.