bit7z [Bug]: deal with UTF-16(with surrogate pair) file error

bit7z version

4.0.8

Compilation options

No response

7-zip version

v24.09

7-zip shared library used

lib7zip.dylib

Compilers

Clang

Compiler versions

No response

Architecture

x86_64

Operating system

macOS

Operating system versions

No response

Bug description

item.name throw Exception on macOS

    for (const auto &item : arc)
    {
        std::cout << std::endl;
        std::cout << "  Item index: " << item.index() << std::endl;
        std::cout << "    Name: " << item.name() << std::endl;
        std::cout << "    Extension: " << item.extension() << std::endl;
        std::cout << "    Path: " << item.path() << std::endl;
        std::cout << "    IsDir: " << item.isDir() << std::endl;
        std::cout << "    Size: " << item.size() << std::endl;
        std::cout << "    Packed size: " << item.packSize() << std::endl;
        std::cout << "    CRC: " << std::hex << item.crc() << std::dec << std::endl;
    }


        //7zip return utf-16 string by CHandler::GetProperty(CPP/7zip/Archive/Zip/ZipHandler.cpp)
        //return UString contain double wchar_t discrip the surrogate pair utf-16 char
        //if windows, this is right for utf-16 string
        //is other os, this is difference with wchar_t * (one dword discrip one unicode char)
        //7zz intelnal use UString(double wchar_t discrip one unicode char) transfer back use function CStdOutStream::Convert_UString_to_AString,when cout
        //so 7zz have no error with this
        //but bit7z use BitPropVariant::getNativeString() (src/bitpropvariant.cpp) (contain double wchar_t discrip one unicode char) to get UString from 7z
        //and bit7z::narrow transfer UString to std::string, it will cause error
        //so item.name() throw Exception: wstring_convert: to_bytes error

Steps to reproduce

on macOS

git clone https://github.com/debugee/bit7zip-test.git

cd bit7zip-test

cmake -B build .

cmake --build build --target install

./build/test ./😊123你好.zip

./😊123你好.zip Processing archive: ./😊123你好.zip Archive properties Items count: 2 Folders count: 0 Files count: 2 Size: 220 Packed size: 92

Archived items Item index: 0 Name: Exception: wstring_convert: to_bytes error

Expected behavior

No response

Relevant compilation output

Code of Conduct

[x] By submitting this issue, I agree to follow bit7z's Code of Conduct

Feb 19 '25 09:02 debugee

this issue like https://github.com/rikyoz/bit7z/issues/267

come on！best regards for you!

Feb 19 '25 11:02 debugee

UString u from 7zip if (sizeof(wchar_t) == 32)//macOS or linux { unsigned short o[]; for(w : u){ o[i++] = w;//save half size， o is standard utf-16LE string } iconv(o, "utf-16LE", "utf-32LE");//now we can use it }

bit7z must Reverse this work, when communicate with 7zip

and get all UString from 7zip do this work

can this work ???

Feb 19 '25 15:02 debugee

Hi!

Yeah, as I mentioned in that other issue, this is not much a bit7z's bug, but rather 7-Zip's incorrect string encoding handling.

Bit7z expects UTF-32 wide strings on Linux and macOS, as it is basically the standard encoding used in 99% of the cases for such kind of strings. Unfortunately, 7-Zip doesn't return valid UTF-32 strings when the original string contains UTF-16 surrogate pairs (and also in other cases).

This is actually the major issue blocking the release of the v4.1-beta, as finding a good solution is not easy.

bit7z must Reverse this work, when communicate with 7zip

and get all UString from 7zip do this work

can this work ???

Initially, this was my idea, but unfortunately it is not so simple.

7-Zip behavior changes according to the archive format: 7z archives use UTF-16 strings, so your fix should work, but zip and other archives use other encodings, and 7-Zip seems to decode such strings incorrectly as well.

For example, if the zip archive stores UTF-8 strings, 7-Zip will provide the UTF-8 code units as 32-bit wide characters, which is not a UTF-32 encoded string.

The main problem is then that bit7z's code for the string conversion doesn't have access to the information about the archive format from which the string was read, so it doesn't know which input encoding should convert from. Also, some formats might use different encodings (e.g, the zip format can use other encodings than UTF-8).

I'm working on a solution, which will come in the next v4.1-beta, but it will require some time.

Feb 19 '25 22:02 rikyoz

I see 7zip source code at ZipItem.cpp 、 UTFConvert.cpp 、StringConvert.cpp

ConvertUTF8ToUnicode and MultiByteToUnicodeString2 function alway return UString(one wchar_t discrip one of surrogate)

no matter default use utf8(isUtf8 function default return true at new version 7zip 22.02)

or user manual force utf8

or user manual _specifiedCodePage and _forceCodePage utf8 setting

all utf8 use ConvertUTF8ToUnicode return UString(one wchar_t discrip one of surrogate)

if _specifiedCodePage not set utf8 or not set( use mbstowcs use c locale)

it will call MultiByteToUnicodeString2 and return UString(one wchar_t discrip one of surrogate) . when WCHAR_MAX > 0xffff trans to UString(one wchar_t discrip one of surrogate)

source code comment said this

so I think the solution upon can work ?

Feb 20 '25 07:02 debugee

I see 7zip source code at ZipItem.cpp 、 UTFConvert.cpp 、StringConvert.cpp

ConvertUTF8ToUnicode and MultiByteToUnicodeString2 function alway return UString(one wchar_t discrip one of surrogate)
no matter default use utf8(isUtf8 function default return true at new version 7zip 22.02)
or user manual force utf8
or user manual _specifiedCodePage and _forceCodePage utf8 setting
all utf8 use ConvertUTF8ToUnicode return UString(one wchar_t discrip one of surrogate)

if _specifiedCodePage not set utf8 or not set( use mbstowcs use c locale)

Yeah, the problem is that all these options like _specifiedCodePage and _forceCodePage are not available at the API level for bit7z to use, they're internal to 7-Zip. Even the "cu" option is only available when creating zip archives, not when reading them, as far as I know.

source code comment said this

so I think the solution upon can work ?

As I said, your solution should work for 7z archives.

A similar solution could be used for zip archives, if we just make UTF-8 the default encoding, but then we have to allow the user to specify a different encoding (basically implementing the mcp parameter, see issue #267).

Also, I wouldn't use the iconv library to convert to UTF-32, but rather to convert directly to UTF-8 strings, since the bit7z API uses UTF-8 for strings on Linux and MacOS.

Finally, I still need to evaluate the performance impact of such string conversions, since you are essentially calling the same iconv conversion function for each character of the string, rather than calling it once over a string. I'll have to do some analysis in that sense to figure out the best approach.

As I said, on bit7z side there's some restructuring to do as the information on the archive format doesn't reach the string conversion functions. I would not even rule out implementing a custom string conversion function instead of relying on iconv, but that approach might introduce other problems as well.

Feb 20 '25 21:02 rikyoz

i known now，you want create interface to user to set ccharset，if not provide interface，every is sample，just do with ustring in and out

Feb 21 '25 00:02 debugee

you provide this interface to set charset, just used by 7z internal, do not affect you read and write ustring，because you always read ustring and write ustring, and ustring format is clear。Am I getting it right?

Feb 21 '25 06:02 debugee

Sorry for the late reply.

Am I getting it right?

I'm not sure, as I'm not quite sure what you're trying to say in your last two comments, sorry.

Bit7z cannot change the behavior of 7-Zip's internal code, so it has to deal with the wrongly encoded wide strings (UString).

If the wide characters in such strings are actually UTF-8 code units, or they're simply ASCII characters, then converting them to narrow UTF-8 strings (i.e., the strings used in the bit7z API) is easy.

All other cases will require some thought about how to do the conversion without unnecessary overhead. Also, the lack of support for the mcp option will require some internal changes to how bit7z handles string encoding.

Bit7z could provide the raw UStrings directly to the user, but wide strings are unusual and difficult to handle on Linux and macOS, so I will not consider this solution, as it would make the library harder to use.

Mar 02 '25 22:03 rikyoz