exiv2 icon indicating copy to clipboard operation
exiv2 copied to clipboard

What happened to wstring support in open() APIs?

Open ribtoks opened this issue 1 year ago • 27 comments

Hi

I upgraded exiv2 to 0.28.0 just to find that std::wstring support has been nuked out in 7933ff401ddb9768502c36b1febea06b854e176a but I could not find a matching PR, issue or at least a discussion that would explain why and what to do now.

std::string just does not work in Windows for non-latin1 paths for Exiv2::ImageFactory::open() (Visual Studio compiler).

What is the intended way to use it now?

These are all the old issues that reference the problem:

  • https://dev.exiv2.org/issues/851
  • https://github.com/Exiv2/exiv2/issues/679

ribtoks avatar May 28 '23 08:05 ribtoks

I can confirm that digiKam 8.1.0 pre-release based on Exiv2 0.28 is broken under windows to read metadata from images.

https://bugs.kde.org/show_bug.cgi?id=469372 https://bugs.kde.org/show_bug.cgi?id=470566

cgilles avatar Jun 02 '23 15:06 cgilles

The intended way to use this is to set UTF-8 as the code page. See: https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page

edit: oh I see. digiKam is compiled with MSVCRT instead of UCRT. I don't think that can work.

ping @piponazo

neheb avatar Jun 06 '23 04:06 neheb

No, digiKam is cross compiled under Windows using MXE (MinGW env - https://github.com/mxe/mxe) since 10 years now. The MSVC build is just another way to build the application in the KDE infra, it's not the official production release at all, because MSVC is unsafe,slow, and under-productive platform to compile huge code as digiKam, but it's another story...

Exiv2 team, please revert this commit to drop the UNICODE support under Windows. It's not at all a "dead" code... or explain well the work around. Note: I'm surprised to not see an unit-tests crying for the UTF16 long file path under Windows...

cgilles avatar Jun 06 '23 06:06 cgilles

No, digiKam is cross compiled under Windows using MXE (MinGW env - https://github.com/mxe/mxe) since 10 years now. The MSVC build is just another way to build the application in the KDE infra, it's not the official production release at all, because MSVC is unsafe,slow, and under-productive platform to compile huge code as digiKam, but it's another story...

Exiv2 team, please revert this commit to drop the UNICODE support under Windows. It's not at all a "dead" code... or explain well the work around. Note: I'm surprised to not see an unit-tests crying for the UTF16 long file path under Windows...

Read what I said again. MSVCRT and UCRT

neheb avatar Jun 06 '23 06:06 neheb

@neheb @piponazo What was the reason for deleting perfectly working code in https://github.com/Exiv2/exiv2/commit/7933ff401ddb9768502c36b1febea06b854e176a? I'd like to advocate for reverting this deletion. The removed code was functional and served a purpose, so it doesn't seem like dead code. It's been quite useful for me, and I believe it could continue to benefit other users as well.

The intended way to use this is to set UTF-8 as the code page. See: https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page

I see that the current recommendation is to "set UTF-8 as the code page". It is a bit reckless to ask users of Exiv2 to do that. There are some problems with this approach. First of all, this can perfectly work in tests but there is a reality when you have 150+ external libraries in your project and neither of them mentions that it works flawlessly with relatively new Microsoft invention. There are also loads of your code that relies on assumption "it is not UTF-8 there but something else" you have to check to make sure nothing is broken by forcing UTF-8 code page.

The "UTF-8 as the code page" feature is available only since Windows 1803. For Windows 10 operating systems prior to 1803 (10.0.17134.0), only static linking of UCRT is supported.. Windows 10 version prior 1803 are still supported in LTSC channel and static linking of UCRT is discouraged by Microsoft for security reasons. I don't see a nice way to use 0.28 across all Windows 10 versions.

I'm sharing these insights in the spirit of maintaining the functionality and usability of Exiv2 across a wide range of scenarios. I hope we can discuss this further and consider the potential benefits of restoring the code.

Thank you for your attention and consideration.

maksim-petukhov avatar Aug 15 '23 18:08 maksim-petukhov

@maksim-petukhov more reading here: https://github.com/Exiv2/exiv2/pull/2090

neheb avatar Aug 15 '23 18:08 neheb

@neheb I read it. It is still doesn't answer why @piponazo deleted code for EXIV2_ENABLE_WIN_UNICODE. exiv2.exe binary works with "set UTF-8 as the code page" - great. But as I'm trying to point out, there are big projects out there relying on exiv2 library that can't just "set UTF-8 as the code page" because this could break a lot of things.

maksim-petukhov avatar Aug 15 '23 18:08 maksim-petukhov

And digiKam also depend on Exiv2 in-deep, and we don't have this kind of problem with UTF8 under Windows when it's cross compiled under Linux MinGW. Please revert this code deletion, as the library is not suitable under Windows when it's cross compiled. I second also maksim-petukhov about the topic : "Why to drop working code tested since a vrey long time ?"

NOTE: recommend me to compile whole digiKam with MSVC under Windows is not a solution for digiKam at all...

Gilles Caulier

cgilles avatar Aug 17 '23 18:08 cgilles

NOTE: recommend me to compile whole digiKam with MSVC under Windows is not a solution for digiKam at all...

You have the MSYS2 UCRT64 option (the default now for MSYS2). I think Octave recently switched to UCRT builds also using MXE - @mmuetzel can you please advise the digiKam devs?

kmilos avatar Aug 18 '23 07:08 kmilos

AFAICT, it seems Fedora's MinGW cross toolchain can also target UCRT since F37: https://fedoraproject.org/wiki/Changes/F37MingwUCRT (e.g. https://src.fedoraproject.org/rpms/mingw-zlib/pull-request/3)

Please get in touch with any other distro not supporting and ask for UCRT targets (cf. Arch, Debian, don't know about Ubuntu...)

kmilos avatar Aug 18 '23 07:08 kmilos

Just my 2 cents without getting a complete picture yet: Switching Windows to a UTF-8 locale only affects binaries that use the new Universal C Runtime (UCRT) instead of the older MS Visual C Runtime (MSVCRT). But even then, there are still quite a few applications that don't work when switching Windows to a UTF-8 locale. (IIRC, Notepad++ had some issues last time I checked.)

So, the only way to reliably support non-ASCII characters in file names on Windows is to use their wide character API. I guess that is what is meant when people talk about wstring here?

mmuetzel avatar Aug 18 '23 08:08 mmuetzel

I guess that is what is meant when people talk about wstring here?

Correct. However, we're discussing here about transitioning exiv2 clients to UCRT builds and UTF-8 locale only, as exiv2 in the latest version dropped the W API*. digiKam is a good example, as it uses MXE builds like Octave.

* Why, and why now is a separate discussion, but let's be pragmatic for a moment and focus on the UCRT build transition, as it'll have to be done by people sooner or later anyway...

kmilos avatar Aug 18 '23 08:08 kmilos

I understand that this issue took a turn to become about linking to the UCRT. But to come back to the original issue: Last time I checked, setting Windows to a UTF-8 locale was still experimental. And I'd really urge you guys too not recommend to your users to activate a setting that will leave them with a half borked Windows installation.

mmuetzel avatar Aug 18 '23 08:08 mmuetzel

And I'd really urge you guys too not recommend to your users to activate a setting that will leave them with a half borked Windows installation.

How so? The UTF-8 setlocale setting (or the app manifest) only affects the running process, not the system.

Anyhow, as I said, I'm not too keen on getting into why (feel free to join the prior issues and PRs linked above), but trying to offer workarounds.

kmilos avatar Aug 18 '23 08:08 kmilos

The "UTF-8 as the code page" feature is available only since Windows 1803. For Windows 10 operating systems prior to 1803 (10.0.17134.0), only static linking of UCRT is supported.. Windows 10 version prior 1803 are still supported in LTSC channel and static linking of UCRT is discouraged by Microsoft for security reasons. I don't see a nice way to use 0.28 across all Windows 10 versions.

Btw, even more than one year ago this was already less than 0.2% of all Windows 10/11 machines (since 2018 >90% out of those 2.4% were already on 1803). (Windows 7 & 8.1 are not counted here as they are no longer supported by MS.) If you say a lot of enterprise machines are probably also not counted here, what could it be realistically then given the 3-5 year enterprise cycles? Still less than 1%, no? What percentage of those will ever use exiv2? The UCRT toolchains are there (both native and cross). Let's move on.

kmilos avatar Aug 18 '23 09:08 kmilos

If you are sure that you won't restore using the wide character Windows API, that's your decision.

(Cross-)compiling MinGW and GCC that default to linking to the UCRT should only require adding --with-default-msvcrt=ucrt to their respective configure flags.

mmuetzel avatar Aug 21 '23 16:08 mmuetzel

wstring API removal was premature looks like. Can probably be added back in terms of std::filesystem.

Debian and hence Ubuntu do not have a UCRT target: https://groups.google.com/g/linux.debian.bugs.dist/c/vU9KxW1RTLg

Fedora has one but it's very barebones.

neheb avatar Aug 21 '23 18:08 neheb

Yeah messing around with std::filesystem leads me to believe the wstring API should come back. fs::path is an std::(w)string based on the platform. For Windows it requires calling .string(), which does not work with references as it's a temporary.

neheb avatar Aug 21 '23 19:08 neheb

I encountered this regression as well after upgrading to v0.28.0 and would like to note that Microsoft added UTF-8 code page on top of core Windows functionality, which remains to be UTF-16 (well, UCS-2, actually), so it will end up calling the same W functions in Win32 that _wfopen did, just slower, because of the extra character conversion.

Also, I imagine processes using a specific code page would need to check the code page inherited from the console and report an error if it doesn't match - wouldn't this cause more problems, compared to just using wide characters as 0.26.x did and not worrying about code pages?

It would be great if this breaking change was reverted until a usable alternative is found, whatever it is. You can even make it smaller if you use std::filesystem::path instead of std::string in FileIo::Impl because it already provides functionality to obtain std::string, std::wstring and std::u8string for any given path, so it should be possible to reduce code duplication by just conditionally compiling code where there's difference, like calling _wfopen vs. fopen (or using file I/O traits of sorts).

I will also note that since UTF-8 being the target here, keep in mind that in C++20 char8_t* and char*, as well as std::u8string and std::string are no longer interchangeable, so if you end up calling path functions returning std::u8string, it won't work with the rest of the code expecting std::string.

gh-andre avatar Oct 19 '23 18:10 gh-andre

Requires small API change. See https://github.com/Exiv2/exiv2/pull/2800

neheb avatar Oct 19 '23 18:10 neheb

While it's a good change, I don't mean it as just an improvement, but rather as a way to reduce the amount of conditionally compiled code, which in 0.26.x included entire methods. Having std::filesystem::path would allow compile conditionally only key function calls, such as _wfopen vs. fopen and call the string method appropriate for the API call.

As a side note, you may want to revisit by-value path parameters - most paths are over the short string optimization limit and will unnecessarily allocate memory. Given that std::filesystem::directory_entry::path() returns std::const filesystem::path&, it would reduce memory allocations for apps that scan a lot of images.

gh-andre avatar Oct 20 '23 02:10 gh-andre

@gh-andre public API is kept at C++11 and private is C++17. The ref to value change was because it does not work under Windows. The alternative is to reintroduce const std::string& and const std::wstring&.

neheb avatar Oct 21 '23 19:10 neheb

Not sure what wasn't working on Windows. Maybe it is a C++11 issue, I cannot say. Definitely has no problem in C++17, so I was able to create a patch to handle opening wide-character file paths for Exiv2 v0.28.0. It will only work for files because the implementation isn't cleanly separated for various derived classes of BasicIo and surrounding code.

That is, some things seem like they got worse over time, such as calling fopen and stat outside of the implementation class, and some were like that from the beginning, like calling ::remove and fs::remove within the same block of code, or extracting a part of a URL as if it's a file (e.g. std::make_unique<FileIo>(pathOfFileUrl(path))), which will fail for any URL-encoded sequences, like %20.

Initially, I took a stab at the whole thing, similar to how previous EXIV2_ENABLE_WIN_UNICODE did it, but it was futile because file paths and protocols were all mixed in. The WIN_UNIQUDE patch hacked it in places doing things like assigning wchar_t to char characters, which only works for ASCII, duplicating thumbnail extensions methods, etc. So, I opted for a localized change, just to be able to work with image files. If anyone is interested, here's the patch:

https://github.com/StoneStepsInc/exiv2-nuget/blob/master/patches/01-wide-char-paths.patch

It will apply against the released Exiv2 0.28.0 source in exiv2-0.28.0-Source.tar.gz. Narrow character functions will still work as before, but will not be able to open file paths outside of the current code page. Functions that take const std::filesystem::path& will open Windows paths comprised from valid UTF-16 characters.

gh-andre avatar Oct 22 '23 20:10 gh-andre

If I am not wrong, this could be fixed with #2800? Is there anything blocking it from being merged? Thanks!

AlynxZhou avatar Dec 01 '23 18:12 AlynxZhou

Note : digiKam is now also compiled with VCPKG and MSVC compiler under Windows and the problem is also present in native build of Exiv2.

https://bugs.kde.org/show_bug.cgi?id=483743

Gilles Caulier

cgilles avatar Mar 16 '24 12:03 cgilles

Are you going to fix this bug?

w17 avatar Jun 01 '24 18:06 w17