exiv2
exiv2 copied to clipboard
What happened to wstring support in open() APIs?
Hi
I upgraded exiv2 to 0.28.0 just to find that std::wstring
support has been nuked out in 7933ff401ddb9768502c36b1febea06b854e176a but I could not find a matching PR, issue or at least a discussion that would explain why and what to do now.
std::string
just does not work in Windows for non-latin1 paths for Exiv2::ImageFactory::open()
(Visual Studio compiler).
What is the intended way to use it now?
These are all the old issues that reference the problem:
- https://dev.exiv2.org/issues/851
- https://github.com/Exiv2/exiv2/issues/679
I can confirm that digiKam 8.1.0 pre-release based on Exiv2 0.28 is broken under windows to read metadata from images.
https://bugs.kde.org/show_bug.cgi?id=469372 https://bugs.kde.org/show_bug.cgi?id=470566
The intended way to use this is to set UTF-8 as the code page. See: https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page
edit: oh I see. digiKam is compiled with MSVCRT instead of UCRT. I don't think that can work.
ping @piponazo
No, digiKam is cross compiled under Windows using MXE (MinGW env - https://github.com/mxe/mxe) since 10 years now. The MSVC build is just another way to build the application in the KDE infra, it's not the official production release at all, because MSVC is unsafe,slow, and under-productive platform to compile huge code as digiKam, but it's another story...
Exiv2 team, please revert this commit to drop the UNICODE support under Windows. It's not at all a "dead" code... or explain well the work around. Note: I'm surprised to not see an unit-tests crying for the UTF16 long file path under Windows...
No, digiKam is cross compiled under Windows using MXE (MinGW env - https://github.com/mxe/mxe) since 10 years now. The MSVC build is just another way to build the application in the KDE infra, it's not the official production release at all, because MSVC is unsafe,slow, and under-productive platform to compile huge code as digiKam, but it's another story...
Exiv2 team, please revert this commit to drop the UNICODE support under Windows. It's not at all a "dead" code... or explain well the work around. Note: I'm surprised to not see an unit-tests crying for the UTF16 long file path under Windows...
Read what I said again. MSVCRT and UCRT
@neheb @piponazo What was the reason for deleting perfectly working code in https://github.com/Exiv2/exiv2/commit/7933ff401ddb9768502c36b1febea06b854e176a? I'd like to advocate for reverting this deletion. The removed code was functional and served a purpose, so it doesn't seem like dead code. It's been quite useful for me, and I believe it could continue to benefit other users as well.
The intended way to use this is to set UTF-8 as the code page. See: https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page
I see that the current recommendation is to "set UTF-8 as the code page". It is a bit reckless to ask users of Exiv2 to do that. There are some problems with this approach. First of all, this can perfectly work in tests but there is a reality when you have 150+ external libraries in your project and neither of them mentions that it works flawlessly with relatively new Microsoft invention. There are also loads of your code that relies on assumption "it is not UTF-8 there but something else" you have to check to make sure nothing is broken by forcing UTF-8 code page.
The "UTF-8 as the code page" feature is available only since Windows 1803. For Windows 10 operating systems prior to 1803 (10.0.17134.0), only static linking of UCRT is supported.. Windows 10 version prior 1803 are still supported in LTSC channel and static linking of UCRT is discouraged by Microsoft for security reasons. I don't see a nice way to use 0.28 across all Windows 10 versions.
I'm sharing these insights in the spirit of maintaining the functionality and usability of Exiv2 across a wide range of scenarios. I hope we can discuss this further and consider the potential benefits of restoring the code.
Thank you for your attention and consideration.
@maksim-petukhov more reading here: https://github.com/Exiv2/exiv2/pull/2090
@neheb I read it. It is still doesn't answer why @piponazo deleted code for EXIV2_ENABLE_WIN_UNICODE
. exiv2.exe binary works with "set UTF-8 as the code page" - great. But as I'm trying to point out, there are big projects out there relying on exiv2 library that can't just "set UTF-8 as the code page" because this could break a lot of things.
And digiKam also depend on Exiv2 in-deep, and we don't have this kind of problem with UTF8 under Windows when it's cross compiled under Linux MinGW. Please revert this code deletion, as the library is not suitable under Windows when it's cross compiled. I second also maksim-petukhov about the topic : "Why to drop working code tested since a vrey long time ?"
NOTE: recommend me to compile whole digiKam with MSVC under Windows is not a solution for digiKam at all...
Gilles Caulier
NOTE: recommend me to compile whole digiKam with MSVC under Windows is not a solution for digiKam at all...
You have the MSYS2 UCRT64 option (the default now for MSYS2). I think Octave recently switched to UCRT builds also using MXE - @mmuetzel can you please advise the digiKam devs?
AFAICT, it seems Fedora's MinGW cross toolchain can also target UCRT since F37: https://fedoraproject.org/wiki/Changes/F37MingwUCRT (e.g. https://src.fedoraproject.org/rpms/mingw-zlib/pull-request/3)
Please get in touch with any other distro not supporting and ask for UCRT targets (cf. Arch, Debian, don't know about Ubuntu...)
Just my 2 cents without getting a complete picture yet: Switching Windows to a UTF-8 locale only affects binaries that use the new Universal C Runtime (UCRT) instead of the older MS Visual C Runtime (MSVCRT). But even then, there are still quite a few applications that don't work when switching Windows to a UTF-8 locale. (IIRC, Notepad++ had some issues last time I checked.)
So, the only way to reliably support non-ASCII characters in file names on Windows is to use their wide character API. I guess that is what is meant when people talk about wstring
here?
I guess that is what is meant when people talk about
wstring
here?
Correct. However, we're discussing here about transitioning exiv2 clients to UCRT builds and UTF-8 locale only, as exiv2 in the latest version dropped the W API*. digiKam is a good example, as it uses MXE builds like Octave.
* Why, and why now is a separate discussion, but let's be pragmatic for a moment and focus on the UCRT build transition, as it'll have to be done by people sooner or later anyway...
I understand that this issue took a turn to become about linking to the UCRT. But to come back to the original issue: Last time I checked, setting Windows to a UTF-8 locale was still experimental. And I'd really urge you guys too not recommend to your users to activate a setting that will leave them with a half borked Windows installation.
And I'd really urge you guys too not recommend to your users to activate a setting that will leave them with a half borked Windows installation.
How so? The UTF-8 setlocale
setting (or the app manifest) only affects the running process, not the system.
Anyhow, as I said, I'm not too keen on getting into why (feel free to join the prior issues and PRs linked above), but trying to offer workarounds.
The "UTF-8 as the code page" feature is available only since Windows 1803. For Windows 10 operating systems prior to 1803 (10.0.17134.0), only static linking of UCRT is supported.. Windows 10 version prior 1803 are still supported in LTSC channel and static linking of UCRT is discouraged by Microsoft for security reasons. I don't see a nice way to use 0.28 across all Windows 10 versions.
Btw, even more than one year ago this was already less than 0.2% of all Windows 10/11 machines (since 2018 >90% out of those 2.4% were already on 1803). (Windows 7 & 8.1 are not counted here as they are no longer supported by MS.) If you say a lot of enterprise machines are probably also not counted here, what could it be realistically then given the 3-5 year enterprise cycles? Still less than 1%, no? What percentage of those will ever use exiv2? The UCRT toolchains are there (both native and cross). Let's move on.
If you are sure that you won't restore using the wide character Windows API, that's your decision.
(Cross-)compiling MinGW and GCC that default to linking to the UCRT should only require adding --with-default-msvcrt=ucrt
to their respective configure
flags.
wstring API removal was premature looks like. Can probably be added back in terms of std::filesystem.
Debian and hence Ubuntu do not have a UCRT target: https://groups.google.com/g/linux.debian.bugs.dist/c/vU9KxW1RTLg
Fedora has one but it's very barebones.
Yeah messing around with std::filesystem leads me to believe the wstring API should come back. fs::path is an std::(w)string based on the platform. For Windows it requires calling .string(), which does not work with references as it's a temporary.
I encountered this regression as well after upgrading to v0.28.0 and would like to note that Microsoft added UTF-8 code page on top of core Windows functionality, which remains to be UTF-16 (well, UCS-2, actually), so it will end up calling the same W
functions in Win32 that _wfopen
did, just slower, because of the extra character conversion.
Also, I imagine processes using a specific code page would need to check the code page inherited from the console and report an error if it doesn't match - wouldn't this cause more problems, compared to just using wide characters as 0.26.x did and not worrying about code pages?
It would be great if this breaking change was reverted until a usable alternative is found, whatever it is. You can even make it smaller if you use std::filesystem::path
instead of std::string
in FileIo::Impl
because it already provides functionality to obtain std::string
, std::wstring
and std::u8string
for any given path, so it should be possible to reduce code duplication by just conditionally compiling code where there's difference, like calling _wfopen
vs. fopen
(or using file I/O traits of sorts).
I will also note that since UTF-8 being the target here, keep in mind that in C++20 char8_t*
and char*
, as well as std::u8string
and std::string
are no longer interchangeable, so if you end up calling path
functions returning std::u8string
, it won't work with the rest of the code expecting std::string
.
Requires small API change. See https://github.com/Exiv2/exiv2/pull/2800
While it's a good change, I don't mean it as just an improvement, but rather as a way to reduce the amount of conditionally compiled code, which in 0.26.x included entire methods. Having std::filesystem::path
would allow compile conditionally only key function calls, such as _wfopen
vs. fopen
and call the string method appropriate for the API call.
As a side note, you may want to revisit by-value path
parameters - most paths are over the short string optimization limit and will unnecessarily allocate memory. Given that std::filesystem::directory_entry::path()
returns std::const filesystem::path&
, it would reduce memory allocations for apps that scan a lot of images.
@gh-andre public API is kept at C++11 and private is C++17. The ref to value change was because it does not work under Windows. The alternative is to reintroduce const std::string& and const std::wstring&.
Not sure what wasn't working on Windows. Maybe it is a C++11 issue, I cannot say. Definitely has no problem in C++17, so I was able to create a patch to handle opening wide-character file paths for Exiv2 v0.28.0. It will only work for files because the implementation isn't cleanly separated for various derived classes of BasicIo
and surrounding code.
That is, some things seem like they got worse over time, such as calling fopen
and stat
outside of the implementation class, and some were like that from the beginning, like calling ::remove
and fs::remove
within the same block of code, or extracting a part of a URL as if it's a file (e.g. std::make_unique<FileIo>(pathOfFileUrl(path))
), which will fail for any URL-encoded sequences, like %20
.
Initially, I took a stab at the whole thing, similar to how previous EXIV2_ENABLE_WIN_UNICODE
did it, but it was futile because file paths and protocols were all mixed in. The WIN_UNIQUDE patch hacked it in places doing things like assigning wchar_t
to char
characters, which only works for ASCII, duplicating thumbnail extensions methods, etc. So, I opted for a localized change, just to be able to work with image files. If anyone is interested, here's the patch:
https://github.com/StoneStepsInc/exiv2-nuget/blob/master/patches/01-wide-char-paths.patch
It will apply against the released Exiv2 0.28.0 source in exiv2-0.28.0-Source.tar.gz
. Narrow character functions will still work as before, but will not be able to open file paths outside of the current code page. Functions that take const std::filesystem::path&
will open Windows paths comprised from valid UTF-16 characters.
If I am not wrong, this could be fixed with #2800? Is there anything blocking it from being merged? Thanks!
Note : digiKam is now also compiled with VCPKG and MSVC compiler under Windows and the problem is also present in native build of Exiv2.
https://bugs.kde.org/show_bug.cgi?id=483743
Gilles Caulier
@cgilles As noted in the darktable pull request and elsewhere, please do not use the .manifest directly (MSVC only) - if you wrap it in a .rc, it works for both MinGW and MSVC (you'll need one extra option for MSVC though...)
Are you going to fix this bug?