STL icon indicating copy to clipboard operation
STL copied to clipboard

`<filesystem>`: MSVC's `path` conversion from wide to narrow throws an exception (`fs::path(L"要").string()`)

Open stevewgr opened this issue 1 year ago • 7 comments

Describe the bug

MSVC's std::filesystem::path conversion from wide to narrows crashing with the following snippet:

#include <iostream>
#include <filesystem>
int main() {
    std::filesystem::path(L"要らない.exe").string();
    std::cout << "Hello World!\n";
}

I also tried in a fresh sandbox with fresh latest VS Community installation and crash is happening on both builds x86|x64 Debug|Release: Microsoft C++ exception: std::system_error at memory location

Also tried compiling with and without unicode enabled. Same behavior.

Expected behavior

conversion should succeed and not crash the program.

STL version

Microsoft Visual Studio Community 2022 (64-bit) - Current Version 17.11.1

stevewgr avatar Nov 17 '24 06:11 stevewgr

The error code seems to be the following.

ERROR_NO_UNICODE_TRANSLATION

1113 (0x459)

No mapping for the Unicode character exists in the target multi-byte code page.

The exception can be avoided when the source file is encoded in UTF-8 and the program is compiled with /utf-8.

frederick-vs-ja avatar Nov 17 '24 15:11 frederick-vs-ja

Works fine for me on my "Beta: Use UTF-8 for language support" machine, and the same for Compiler Explorer (https://www.godbolt.org/z/oGMbo3YE1). The problem is most likely:

  1. The compiler and editor have differing notions of the source encoding, so the compiler sees a gibberish string in the source file. Using a pure-ascii encoding of the string literal (L"\u8981\u3089\u306a\u3044.exe") will avoid this.
  2. The active codepage (the narrow encoding the win32 APIs and therefore path uses at runtime) can't represent 要らない so the transcoding in path::string fails (this is the error @frederick-vs-ja refers to above).

CaseyCarter avatar Nov 17 '24 18:11 CaseyCarter

The error code seems to be the following.

ERROR_NO_UNICODE_TRANSLATION 1113 (0x459) No mapping for the Unicode character exists in the target multi-byte code page.

The exception can be avoided when the source file is encoded in UTF-8 and the program is compiled with /utf-8.

Yes, that I already did and could still reproduce. Try disabling the beta feature in your Region system settings and then restart your computer. You'll be able to reproduce. I tried with both, where the file is UTF8 encoded with and without BOM. Also of course the /utf-8 compiler flag or even explictly defining the codepage for Korean characters like /source-charset:utf-8 /execution-charset:.949 based on the docs: https://learn.microsoft.com/en-us/cpp/build/reference/utf-8-set-source-and-executable-character-sets-to-utf-8?view=msvc-170

stevewgr avatar Nov 17 '24 23:11 stevewgr

Works fine for me on my "Beta: Use UTF-8 for language support" machine, and the same for Compiler Explorer (https://www.godbolt.org/z/oGMbo3YE1).

Of course that works, but that's unfortunately not a solution I can instruct the consumers of my application to use. I tried that on Godbolt before submitting the ticket and it indeed worked. I believe (not sure) the reason is because they don't use natively Windows machines, might be some bootstrapped / dockerized system or possibly already have UTF-8 for language support enabled. I remember Matt Godbolt was talking about some of these challenges with msvc in one of his talks. Can be done also via powershell:

Set-ItemProperty -Path 'HKLM:\SYSTEM\CurrentControlSet\Control\Nls\CodePage' -Name 'ACP' -Value '65001'
Set-ItemProperty -Path 'HKLM:\SYSTEM\CurrentControlSet\Control\Nls\CodePage' -Name 'OEMCP' -Value '65001'
Set-ItemProperty -Path 'HKLM:\SYSTEM\CurrentControlSet\Control\Nls\CodePage' -Name 'MACCP' -Value '65001'

Here he talks further about some of the changes they made: https://xania.org/202407/msvc-on-ce Since it was a recent article, who knows, maybe now they actually use wine on Linux.

stevewgr avatar Nov 17 '24 23:11 stevewgr

I'm not sure there's a bug here. If the library is throwing ERROR_NO_UNICODE_TRANSLATION to tell you that there are characters in the path that can't be represented in the active codepage, that's not a crash but expected behavior. If you believe the library is incorrect we need some more information to reproduce the problem. What is int(__std_fs_code_page())? What is the active code page in the console when the program runs?

CaseyCarter avatar Nov 18 '24 05:11 CaseyCarter

We talked about this at the weekly maintainer meeting - I agree with Casey that this sounds by design but we need more info.

The filesystem codepage, the source character set (which is a non-issue if you use universal-character-names for your repro), and the execution character set (the last two are controlled by /utf-8 which we strongly recommend), are relevant here. Casey and I now believe that the console code page is not relevant (the repro doesn't attempt to write Unicode to the console, and if it had to for diagnostic purposes, compiling with /utf-8 and using <print> would write Unicode without introducing wacky questions about the console code page).

StephanTLavavej avatar Nov 20 '24 22:11 StephanTLavavej

Having the same issue trying to iterate over a directory containing files with UTF8 names.

#include <filesystem>
#include <iostream>

int main(int argc, const char** argv) {
    const auto* dir = argc > 1 ? argv[1] : ".";
    auto it = std::filesystem::directory_iterator{ dir };
    for (const auto& entry : it) {
        std::cout << entry << "\n";
    }
}
> mkdir norepro
> '' > norepro/yellow.txt
> main.exe norepro
norepro\yellow.txt
> mkdir repro
> '' > repro/żółć.txt
> main.exe repro
# crash

Works fine on Linux, also works on Windows when using MINGW (save for corrupted output).

Set-ItemProperty -Path 'HKLM:\SYSTEM\CurrentControlSet\Control\Nls\CodePage' -Name 'ACP' -Value '65001' Set-ItemProperty -Path 'HKLM:\SYSTEM\CurrentControlSet\Control\Nls\CodePage' -Name 'OEMCP' -Value '65001' Set-ItemProperty -Path 'HKLM:\SYSTEM\CurrentControlSet\Control\Nls\CodePage' -Name 'MACCP' -Value '65001'

This fixes the crash but still leaves the output corrupted which requires calling SetConsoleOutputCP(CP_UTF8) or adding -utf-8 to compile flags.

dkaszews avatar Nov 12 '25 10:11 dkaszews