[Bug]: deal with UTF-16(with surrogate pair) file error
bit7z version
4.0.8
Compilation options
No response
7-zip version
v24.09
7-zip shared library used
lib7zip.dylib
Compilers
Clang
Compiler versions
No response
Architecture
x86_64
Operating system
macOS
Operating system versions
No response
Bug description
item.name throw Exception on macOS
for (const auto &item : arc)
{
std::cout << std::endl;
std::cout << " Item index: " << item.index() << std::endl;
std::cout << " Name: " << item.name() << std::endl;
std::cout << " Extension: " << item.extension() << std::endl;
std::cout << " Path: " << item.path() << std::endl;
std::cout << " IsDir: " << item.isDir() << std::endl;
std::cout << " Size: " << item.size() << std::endl;
std::cout << " Packed size: " << item.packSize() << std::endl;
std::cout << " CRC: " << std::hex << item.crc() << std::dec << std::endl;
}
//7zip return utf-16 string by CHandler::GetProperty(CPP/7zip/Archive/Zip/ZipHandler.cpp)
//return UString contain double wchar_t discrip the surrogate pair utf-16 char
//if windows, this is right for utf-16 string
//is other os, this is difference with wchar_t * (one dword discrip one unicode char)
//7zz intelnal use UString(double wchar_t discrip one unicode char) transfer back use function CStdOutStream::Convert_UString_to_AString,when cout
//so 7zz have no error with this
//but bit7z use BitPropVariant::getNativeString() (src/bitpropvariant.cpp) (contain double wchar_t discrip one unicode char) to get UString from 7z
//and bit7z::narrow transfer UString to std::string, it will cause error
//so item.name() throw Exception: wstring_convert: to_bytes error
Steps to reproduce
on macOS
git clone https://github.com/debugee/bit7zip-test.git
cd bit7zip-test
cmake -B build .
cmake --build build --target install
./build/test ./😊123你好.zip
./😊123你好.zip Processing archive: ./😊123你好.zip Archive properties Items count: 2 Folders count: 0 Files count: 2 Size: 220 Packed size: 92
Archived items Item index: 0 Name: Exception: wstring_convert: to_bytes error
Expected behavior
No response
Relevant compilation output
Code of Conduct
- [x] By submitting this issue, I agree to follow bit7z's Code of Conduct
this issue like https://github.com/rikyoz/bit7z/issues/267
come on!best regards for you!
UString u from 7zip if (sizeof(wchar_t) == 32)//macOS or linux { unsigned short o[]; for(w : u){ o[i++] = w;//save half size, o is standard utf-16LE string } iconv(o, "utf-16LE", "utf-32LE");//now we can use it }
bit7z must Reverse this work, when communicate with 7zip
and get all UString from 7zip do this work
can this work ???
Hi!
Yeah, as I mentioned in that other issue, this is not much a bit7z's bug, but rather 7-Zip's incorrect string encoding handling.
Bit7z expects UTF-32 wide strings on Linux and macOS, as it is basically the standard encoding used in 99% of the cases for such kind of strings. Unfortunately, 7-Zip doesn't return valid UTF-32 strings when the original string contains UTF-16 surrogate pairs (and also in other cases).
This is actually the major issue blocking the release of the v4.1-beta, as finding a good solution is not easy.
bit7z must Reverse this work, when communicate with 7zip
and get all UString from 7zip do this work
can this work ???
Initially, this was my idea, but unfortunately it is not so simple.
7-Zip behavior changes according to the archive format: 7z archives use UTF-16 strings, so your fix should work, but zip and other archives use other encodings, and 7-Zip seems to decode such strings incorrectly as well.
For example, if the zip archive stores UTF-8 strings, 7-Zip will provide the UTF-8 code units as 32-bit wide characters, which is not a UTF-32 encoded string.
The main problem is then that bit7z's code for the string conversion doesn't have access to the information about the archive format from which the string was read, so it doesn't know which input encoding should convert from. Also, some formats might use different encodings (e.g, the zip format can use other encodings than UTF-8).
I'm working on a solution, which will come in the next v4.1-beta, but it will require some time.
I see 7zip source code at ZipItem.cpp 、 UTFConvert.cpp 、StringConvert.cpp
ConvertUTF8ToUnicode and MultiByteToUnicodeString2 function alway return UString(one wchar_t discrip one of surrogate)
no matter default use utf8(isUtf8 function default return true at new version 7zip 22.02)
or user manual force utf8
or user manual _specifiedCodePage and _forceCodePage utf8 setting
all utf8 use ConvertUTF8ToUnicode return UString(one wchar_t discrip one of surrogate)
if _specifiedCodePage not set utf8 or not set( use mbstowcs use c locale)
it will call MultiByteToUnicodeString2 and return UString(one wchar_t discrip one of surrogate) . when WCHAR_MAX > 0xffff trans to UString(one wchar_t discrip one of surrogate)
source code comment said this
so I think the solution upon can work ?
I see 7zip source code at ZipItem.cpp 、 UTFConvert.cpp 、StringConvert.cpp
ConvertUTF8ToUnicode and MultiByteToUnicodeString2 function alway return UString(one wchar_t discrip one of surrogate)
no matter default use utf8(isUtf8 function default return true at new version 7zip 22.02)
or user manual force utf8
or user manual _specifiedCodePage and _forceCodePage utf8 setting
all utf8 use ConvertUTF8ToUnicode return UString(one wchar_t discrip one of surrogate)
if _specifiedCodePage not set utf8 or not set( use mbstowcs use c locale)
Yeah, the problem is that all these options like _specifiedCodePage and _forceCodePage are not available at the API level for bit7z to use, they're internal to 7-Zip. Even the "cu" option is only available when creating zip archives, not when reading them, as far as I know.
source code comment said this
so I think the solution upon can work ?
As I said, your solution should work for 7z archives.
A similar solution could be used for zip archives, if we just make UTF-8 the default encoding, but then we have to allow the user to specify a different encoding (basically implementing the mcp parameter, see issue #267).
Also, I wouldn't use the iconv library to convert to UTF-32, but rather to convert directly to UTF-8 strings, since the bit7z API uses UTF-8 for strings on Linux and MacOS.
Finally, I still need to evaluate the performance impact of such string conversions, since you are essentially calling the same iconv conversion function for each character of the string, rather than calling it once over a string. I'll have to do some analysis in that sense to figure out the best approach.
As I said, on bit7z side there's some restructuring to do as the information on the archive format doesn't reach the string conversion functions. I would not even rule out implementing a custom string conversion function instead of relying on iconv, but that approach might introduce other problems as well.
i known now,you want create interface to user to set ccharset,if not provide interface,every is sample,just do with ustring in and out
you provide this interface to set charset, just used by 7z internal, do not affect you read and write ustring,because you always read ustring and write ustring, and ustring format is clear。Am I getting it right?
Sorry for the late reply.
Am I getting it right?
I'm not sure, as I'm not quite sure what you're trying to say in your last two comments, sorry.
Bit7z cannot change the behavior of 7-Zip's internal code, so it has to deal with the wrongly encoded wide strings (UString).
If the wide characters in such strings are actually UTF-8 code units, or they're simply ASCII characters, then converting them to narrow UTF-8 strings (i.e., the strings used in the bit7z API) is easy.
All other cases will require some thought about how to do the conversion without unnecessary overhead. Also, the lack of support for the mcp option will require some internal changes to how bit7z handles string encoding.
Bit7z could provide the raw UStrings directly to the user, but wide strings are unusual and difficult to handle on Linux and macOS, so I will not consider this solution, as it would make the library harder to use.
no matter default use utf8(isUtf8 function default return true at new version 7zip 22.02)
or user manual _specifiedCodePage and _forceCodePage utf8 setting
