STL icon indicating copy to clipboard operation
STL copied to clipboard

<filesystem>: prevent filesystem::path dangerous conversions to/from default code page encoding

Open bubnikv opened this issue 4 years ago • 16 comments

Hello.

I would like to ask for a way to disable implicit conversions of const char* to std::filesystem::path using the default 8bit code page encoding and vice versa. If we had to switch from boost::filesystem to std::filesystem, these innocently looking conversions would be a major mine field to us. Indeed, we have had the same problem for years in our product using the C++ multi platform library wxWidgets, where the wxString class behaves the same way on Windows as now the std::filesystem::path in regard to default conversion to/from const char*, and the team members have to be mindful to not trigger these harmful conversions.

I understand the decision to use 16bit characters on Windows for performance reasons. What I do not understand is the decision to convert to / from a local code page by default in the year 2020. It would be great if one had a way to switch the default conversion to UTF-8, either in runtime or by a preprocessor macro. AFAIK the only way to convince std::filesystem to convert to/from UTF-8 by default is to switch the whole Windows 10 to an experimental UTF-8 code page, which is not an option for a production code really, which shall run on Windows 7 and older Windows 10.

I also understand that it is difficult to change the std::filesystem API now as it has already been standardized. I would therefore like to ask for some way to trigger compiler warnings or errors when these default from / to local code page conversions are triggered. Without that we would most likely not switch from boost::filesystem to std::filesystem, as such move would be too risky.

Thank you for consideration, Vojtech Bubnik, Prusa Research

bubnikv avatar Jun 18 '20 15:06 bubnikv

If you want UTF-8, simply use function std::filesystem::u8path. If you also need C++20, use char8_t overloads.

If you want UTF-8 with char and C++20, disable char8_t by /Zc:char8_t- after /std:c++latest, still use u8path, and disable some deprecated warnings.

Berrysoft avatar Jun 19 '20 07:06 Berrysoft

@Berrysoft I understand that we may use the u8path conversion. However developers will forget and often we will find that out after the release, when someone will report that some file operation somewhere does not work with localized paths.

bubnikv avatar Jun 19 '20 12:06 bubnikv

However developers will forget and often we will find that out after the release...

C++ is not a language that will avoid developers' mistake, and also it should not limit the developers' behavior.

Yes, if the standard library uses UTF-8 everywhere, the developers may face less problems, but they may also come across many other problems. Maybe there's a program which should read from a text file in the same code page with the system, and thus the program could not use UTF-8 all the time, because the user won't care about the encoding of input.

In addition, STL is based on CRT of Windows, in which all functions uses the current code page, and thus all the functions of CRT and STL assume that const char* is a string in current code page. If only the constructor of std::filesystem::path is changed, it will cause chaos.

Berrysoft avatar Jun 19 '20 13:06 Berrysoft

How exactly does Boost.Filesystem prevent such conversions? According to my understanding, it behaves the same as std::filesystem with respect to char (although it predated char8_t).

For an error, this is a request for a non-Standard extension, which is explicitly a non-goal of our project. For a warning, this is a request to warn about Standard-conforming usage, which we don't warn about as a general rule. (We emit deprecation warnings for ISO-deprecated features, and non-Standard features that we're trying to get rid of, but otherwise we don't emit warnings from the library for Standard usage. One reason is that we don't have a mechanism for emitting such warnings other than deprecation; another is that the library has a limited local view of the program, instead of the more global view that the compiler and/or code analysis enjoy.) If you want to ban usage of char in your codebase through code review, you can do that. From the STL's perspective, constructing a path from const char* is an entirely reasonable operation (also lossless, as we can convert that to UTF-16). Converting to std::string is lossy, but if the user has explicitly requested that, we assume that they know that the path can be converted to narrow characters.

I'm inclined to close this as wontfix but we'll see what the other maintainers think.

StephanTLavavej avatar Jun 21 '20 02:06 StephanTLavavej

How exactly does Boost.Filesystem prevent such conversions?

As far as I know, there's a method imbue to change the locale of path class, which could be used for casting from multi-byte string to wide string.

Berrysoft avatar Jun 21 '20 02:06 Berrysoft

std::filesystem::path permits construction from a source string and a locale. (This doesn't affect conversions from the native wchar_t to char, if requested.)

StephanTLavavej avatar Jun 21 '20 02:06 StephanTLavavej

Stephan,

I really appreciate you guys are listening.

How exactly does Boost.Filesystem prevent such conversions? According to my understanding, it behaves the same as std::filesystem with respect to char (although it predated char8_t).

As @Berrysoft noted, we call imbue to get utf8 behavior on boost::filesystem::path. I understand that it leads to costly conversions between utf8 and wide characters on Windows, but for our purpose the costly conversion is much cheaper than bug squashing of localization issues. https://github.com/prusa3d/PrusaSlicer/blob/d6e040c282a9992d4cfebe85b3a931a43b2773d0/src/boost/nowide/integration/filesystem.hpp#L20

For an error, this is a request for a non-Standard extension, which is explicitly a non-goal of our project. For a warning, this is a request to warn about Standard-conforming usage, which we don't warn about as a general rule. (We emit deprecation warnings for ISO-deprecated features, and non-Standard features that we're trying to get rid of, but otherwise we don't emit warnings from the library for Standard usage.

I am an application programmer and a computational geometer by a trade. I am more a hands on person than a purist. Being a Linux militant evangelist during my studies, then developing Windows applications for 15 years, then multi-platform applications for another 4 years, I was burned by internationalization issues many times. Frankly I cannot imagine using a local code page API for anything else than a quick prototype. For anything else using a local code page API on Windows is IMHO almost always an error even for an in-house product in this connected world: Somebody somewhere will have her/his Windows set to one language and type file names and paths in another language.

One reason is that we don't have a mechanism for emitting such warnings other than deprecation; another is that the library has a limited local view of the program, instead of the more global view that the compiler and/or code analysis enjoy.) If you want to ban usage of char in your codebase through code review, you can do that.

Complete code review is something that you guys may afford at Microsoft, but a small software house cannot. Also it is too easy to overlook these kind of bugs. We need an automatic solution.

From the STL's perspective, constructing a path from const char* is an entirely reasonable operation (also lossless, as we can convert that to UTF-16). Converting to std::string is lossy, but if the user has explicitly requested that, we assume that they know that the path can be converted to narrow characters.

Again, your statement is the purist point of view. In the real world, if the default conversions do what you almost always don't want to do and these default conversions are hidden behind innocently looking methods, it is an invitation to errors. At the current state of affairs, I can only see myself using std::filesystem on Linux only: Mine field on Windows, and the default OSX compiler links against stdc++ library, which is shipped with the system, thus starting to support std::filesystem with the latest OSX Catalina.

If I had to apply std::filesystem to our multi-platform Windows project which uses UTF-8 internally, I would most likely have to hack your stdlib source code to emit errors on local code page conversions, at least on the build server.

Vojtech

bubnikv avatar Jun 22 '20 07:06 bubnikv

Well, if you don't need other code pages, don't want to change or review codes, and need filesystem on all platforms, I suggest you still using boost::filesystem.

Or do another hack: write your own filesystem based on std::filesystem. Just write a path class inherits std::filesystem::path, and rewrite methods related to std::string and std::string_view. Then write other classes and functions accepting the new path, and forward it to the original ones. It's somehow dirty, but works.

Berrysoft avatar Jun 22 '20 13:06 Berrysoft

write setlocale(LC_ALL, ".UTF-8") then the char string for constructing path will be treated as utf8 encoding. it seem only usable in vs2019 and such behavior have not been documented see #469

DailyShana avatar Jun 23 '20 14:06 DailyShana

@DailyShana

write setlocale(LC_ALL, ".UTF-8")

Thanks for the hint. Does it work on Windows 7 as well? I am not sure how the other libraries would handle it though.

bubnikv avatar Jun 23 '20 15:06 bubnikv

I tested. It works on Windows 7

#include <stdio.h>
#include <locale>
#include <filesystem>

using namespace std::filesystem;

int main()
{
    setlocale(LC_ALL, ".UTF-8");
    path p{ "中文" };
    printf(p.u8string().c_str());
    wprintf(p.c_str());
    return 0;
}

output

中文中文

DailyShana avatar Jun 24 '20 13:06 DailyShana

We talked about this, and we'll review a PR to add such a warning, off-by-default, through the deprecated attribute mechanism (properly phrased as this is neither an ISO nor a Microsoft deprecation).

StephanTLavavej avatar Jun 24 '20 21:06 StephanTLavavej

Agree with OP. A warning (and optionally error) would be nice.

Can confirm setlocale(LC_ALL, ".UTF-8") works.

Have also found Use UTF-8 code pages in Windows apps which also makes std::filesystem::path() accept UTF-8 strings and std::filesystem::path::string() output UTF-8 strings, but also makes argv, envp and all A- Win32 API calls use UTF-8. This requires Windows 10, Version 1903. It's also the default on Windows 11 through the Beta: Use Unicode UTF-8 for worldwide language support setting (the setting is off by default on Windows 10).

SephiRok avatar Jan 18 '22 20:01 SephiRok

We talked about this, and we'll review a PR to add such a warning, off-by-default, through the deprecated attribute mechanism (properly phrased as this is neither an ISO nor a Microsoft deprecation).

I am wondering how I can turn this one on because I don't see anything related to this inside the code.

Besides that, it is still weird, that a valid std::filesystem::path can throw when calling string(), which should be one of the most innocent functions.

Trafo avatar Mar 16 '23 06:03 Trafo

Besides that, it is still weird, that a valid std::filesystem::path can throw when calling string(), which should be one of the most innocent functions.

Generally, allocation is needed for copying the internally stored string into the return value, so we can't avoid potential throwing.


I am wondering how I can turn this one on because I don't see anything related to this inside the code.

Perhaps it's suitable to conditionally deprecate these constructors: https://github.com/microsoft/STL/blob/8eb3a067912f1362417c725ff679dda463f44087/stl/inc/filesystem#L648-L674

The mechanism for conditional deprecation can be found in #634.

frederick-vs-ja avatar Mar 16 '23 09:03 frederick-vs-ja

What is the state on the planned compiler warning that was mentioned before?

We are also seeing this problem on a large multi-platform code base and it is a showstopper for the migration from Boost.Filesystem to the Standard version.

It is impossible to manually track down all character literals that may end up as std::filesystem::path instances. This implicit conversion is guaranteed to introduce bugs that are hard to track down.

A compiler warning would really be a powerful mitigation of the problem.

pkl97 avatar Oct 28 '23 06:10 pkl97