STL
STL copied to clipboard
<filesystem>: prevent filesystem::path dangerous conversions to/from default code page encoding
Hello.
I would like to ask for a way to disable implicit conversions of const char* to std::filesystem::path using the default 8bit code page encoding and vice versa. If we had to switch from boost::filesystem to std::filesystem, these innocently looking conversions would be a major mine field to us. Indeed, we have had the same problem for years in our product using the C++ multi platform library wxWidgets, where the wxString class behaves the same way on Windows as now the std::filesystem::path in regard to default conversion to/from const char*, and the team members have to be mindful to not trigger these harmful conversions.
I understand the decision to use 16bit characters on Windows for performance reasons. What I do not understand is the decision to convert to / from a local code page by default in the year 2020. It would be great if one had a way to switch the default conversion to UTF-8, either in runtime or by a preprocessor macro. AFAIK the only way to convince std::filesystem to convert to/from UTF-8 by default is to switch the whole Windows 10 to an experimental UTF-8 code page, which is not an option for a production code really, which shall run on Windows 7 and older Windows 10.
I also understand that it is difficult to change the std::filesystem API now as it has already been standardized. I would therefore like to ask for some way to trigger compiler warnings or errors when these default from / to local code page conversions are triggered. Without that we would most likely not switch from boost::filesystem to std::filesystem, as such move would be too risky.
Thank you for consideration, Vojtech Bubnik, Prusa Research
If you want UTF-8, simply use function std::filesystem::u8path
. If you also need C++20, use char8_t
overloads.
If you want UTF-8 with char
and C++20, disable char8_t
by /Zc:char8_t-
after /std:c++latest
, still use u8path
, and disable some deprecated warnings.
@Berrysoft I understand that we may use the u8path conversion. However developers will forget and often we will find that out after the release, when someone will report that some file operation somewhere does not work with localized paths.
However developers will forget and often we will find that out after the release...
C++ is not a language that will avoid developers' mistake, and also it should not limit the developers' behavior.
Yes, if the standard library uses UTF-8 everywhere, the developers may face less problems, but they may also come across many other problems. Maybe there's a program which should read from a text file in the same code page with the system, and thus the program could not use UTF-8 all the time, because the user won't care about the encoding of input.
In addition, STL is based on CRT of Windows, in which all functions uses the current code page, and thus all the functions of CRT and STL assume that const char*
is a string in current code page. If only the constructor of std::filesystem::path
is changed, it will cause chaos.
How exactly does Boost.Filesystem prevent such conversions? According to my understanding, it behaves the same as std::filesystem
with respect to char
(although it predated char8_t
).
For an error, this is a request for a non-Standard extension, which is explicitly a non-goal of our project. For a warning, this is a request to warn about Standard-conforming usage, which we don't warn about as a general rule. (We emit deprecation warnings for ISO-deprecated features, and non-Standard features that we're trying to get rid of, but otherwise we don't emit warnings from the library for Standard usage. One reason is that we don't have a mechanism for emitting such warnings other than deprecation; another is that the library has a limited local view of the program, instead of the more global view that the compiler and/or code analysis enjoy.) If you want to ban usage of char
in your codebase through code review, you can do that. From the STL's perspective, constructing a path
from const char*
is an entirely reasonable operation (also lossless, as we can convert that to UTF-16). Converting to std::string
is lossy, but if the user has explicitly requested that, we assume that they know that the path can be converted to narrow characters.
I'm inclined to close this as wontfix
but we'll see what the other maintainers think.
How exactly does Boost.Filesystem prevent such conversions?
As far as I know, there's a method imbue
to change the locale of path class, which could be used for casting from multi-byte string to wide string.
std::filesystem::path
permits construction from a source string and a locale. (This doesn't affect conversions from the native wchar_t
to char
, if requested.)
Stephan,
I really appreciate you guys are listening.
How exactly does Boost.Filesystem prevent such conversions? According to my understanding, it behaves the same as std::filesystem with respect to char (although it predated char8_t).
As @Berrysoft noted, we call imbue to get utf8 behavior on boost::filesystem::path. I understand that it leads to costly conversions between utf8 and wide characters on Windows, but for our purpose the costly conversion is much cheaper than bug squashing of localization issues. https://github.com/prusa3d/PrusaSlicer/blob/d6e040c282a9992d4cfebe85b3a931a43b2773d0/src/boost/nowide/integration/filesystem.hpp#L20
For an error, this is a request for a non-Standard extension, which is explicitly a non-goal of our project. For a warning, this is a request to warn about Standard-conforming usage, which we don't warn about as a general rule. (We emit deprecation warnings for ISO-deprecated features, and non-Standard features that we're trying to get rid of, but otherwise we don't emit warnings from the library for Standard usage.
I am an application programmer and a computational geometer by a trade. I am more a hands on person than a purist. Being a Linux militant evangelist during my studies, then developing Windows applications for 15 years, then multi-platform applications for another 4 years, I was burned by internationalization issues many times. Frankly I cannot imagine using a local code page API for anything else than a quick prototype. For anything else using a local code page API on Windows is IMHO almost always an error even for an in-house product in this connected world: Somebody somewhere will have her/his Windows set to one language and type file names and paths in another language.
One reason is that we don't have a mechanism for emitting such warnings other than deprecation; another is that the library has a limited local view of the program, instead of the more global view that the compiler and/or code analysis enjoy.) If you want to ban usage of char in your codebase through code review, you can do that.
Complete code review is something that you guys may afford at Microsoft, but a small software house cannot. Also it is too easy to overlook these kind of bugs. We need an automatic solution.
From the STL's perspective, constructing a path from const char* is an entirely reasonable operation (also lossless, as we can convert that to UTF-16). Converting to std::string is lossy, but if the user has explicitly requested that, we assume that they know that the path can be converted to narrow characters.
Again, your statement is the purist point of view. In the real world, if the default conversions do what you almost always don't want to do and these default conversions are hidden behind innocently looking methods, it is an invitation to errors. At the current state of affairs, I can only see myself using std::filesystem on Linux only: Mine field on Windows, and the default OSX compiler links against stdc++ library, which is shipped with the system, thus starting to support std::filesystem with the latest OSX Catalina.
If I had to apply std::filesystem to our multi-platform Windows project which uses UTF-8 internally, I would most likely have to hack your stdlib source code to emit errors on local code page conversions, at least on the build server.
Vojtech
Well, if you don't need other code pages, don't want to change or review codes, and need filesystem
on all platforms, I suggest you still using boost::filesystem
.
Or do another hack: write your own filesystem
based on std::filesystem
. Just write a path
class inherits std::filesystem::path
, and rewrite methods related to std::string
and std::string_view
. Then write other classes and functions accepting the new path
, and forward it to the original ones. It's somehow dirty, but works.
write setlocale(LC_ALL, ".UTF-8")
then the char string for constructing path
will be treated as utf8 encoding.
it seem only usable in vs2019
and such behavior have not been documented
see #469
@DailyShana
write setlocale(LC_ALL, ".UTF-8")
Thanks for the hint. Does it work on Windows 7 as well? I am not sure how the other libraries would handle it though.
I tested. It works on Windows 7
#include <stdio.h>
#include <locale>
#include <filesystem>
using namespace std::filesystem;
int main()
{
setlocale(LC_ALL, ".UTF-8");
path p{ "中文" };
printf(p.u8string().c_str());
wprintf(p.c_str());
return 0;
}
output
中文中文
We talked about this, and we'll review a PR to add such a warning, off-by-default, through the deprecated
attribute mechanism (properly phrased as this is neither an ISO nor a Microsoft deprecation).
Agree with OP. A warning (and optionally error) would be nice.
Can confirm setlocale(LC_ALL, ".UTF-8")
works.
Have also found Use UTF-8 code pages in Windows apps which also makes std::filesystem::path()
accept UTF-8 strings and std::filesystem::path::string()
output UTF-8 strings, but also makes argv, envp and all A- Win32 API calls use UTF-8. This requires Windows 10, Version 1903. It's also the default on Windows 11 through the Beta: Use Unicode UTF-8 for worldwide language support
setting (the setting is off by default on Windows 10).
We talked about this, and we'll review a PR to add such a warning, off-by-default, through the
deprecated
attribute mechanism (properly phrased as this is neither an ISO nor a Microsoft deprecation).
I am wondering how I can turn this one on because I don't see anything related to this inside the code.
Besides that, it is still weird, that a valid std::filesystem::path can throw when calling string(), which should be one of the most innocent functions.
Besides that, it is still weird, that a valid std::filesystem::path can throw when calling string(), which should be one of the most innocent functions.
Generally, allocation is needed for copying the internally stored string into the return value, so we can't avoid potential throwing.
I am wondering how I can turn this one on because I don't see anything related to this inside the code.
Perhaps it's suitable to conditionally deprecate these constructors: https://github.com/microsoft/STL/blob/8eb3a067912f1362417c725ff679dda463f44087/stl/inc/filesystem#L648-L674
The mechanism for conditional deprecation can be found in #634.
What is the state on the planned compiler warning that was mentioned before?
We are also seeing this problem on a large multi-platform code base and it is a showstopper for the migration from Boost.Filesystem to the Standard version.
It is impossible to manually track down all character literals that may end up as std::filesystem::path instances. This implicit conversion is guaranteed to introduce bugs that are hard to track down.
A compiler warning would really be a powerful mitigation of the problem.