7-Zip-JBinding-4Android icon indicating copy to clipboard operation
7-Zip-JBinding-4Android copied to clipboard

Chinese display with messy code

Open ZHJ-30 opened this issue 3 years ago • 15 comments

When I extract Files. String path = (String) inArchive.getProperty(index,PropID.PATH) if return Chinese will garbled. How can I fix it? Looking forward to your reply. Thanks

ZHJ-30 avatar Sep 15 '21 07:09 ZHJ-30

Tested this as follows:

I created a 7z-archive using 7-Zip for Windows, that included both folder name and file name with kanji characters. I used Japanese for this particular test, but 7z uses Unicode UTF-8 for such file names, so this should work similarly for Chinese characters too.

Next I extracted this using 7-Zip-JBinding-4Android and was able confirm the correct folder name (not garbled) using IInArchive.getStringProperty(index, PropID.PATH).

I suspect the archive with Chinese folder names that you are testing with is compressed using a different code page / locale. See similar discussion here: https://sourceforge.net/p/p7zip/discussion/383044/thread/3d213124/

Can you extract the archive correctly using the 7-Zip command line tool?

omicronapps avatar Sep 20 '21 02:09 omicronapps

image image Thanks your answer. My extract the archive code is here. I'm not sure if there's a problem.

ZHJ-30 avatar Sep 30 '21 07:09 ZHJ-30

I don't think there's an issue with the source code or library.

Rather I think this is caused by the file archive that's being extracted.

Can you provide an example archive file that results in garbled characters? For example, by adding this file to a GitHub project, etc.

omicronapps avatar Oct 02 '21 16:10 omicronapps

hello

I have the same problem now. Using some versions of 7Z compressed Chinese files to obtain file names will display garbled characters. The problem file is attached.

I tried to get the codePage and other information to deal with the garbled code problem by myself, but the values I got were all null. Is there any way to get information about the compressed filename character set? And how to determine if the library has successfully read a file's character set information.

I hope you can help analyze the reason. Looking forward to your reply. Thank you very much.

zip压缩包c7z.zip

EvilThunder avatar Oct 23 '21 10:10 EvilThunder

If the file names are not encoded with UTF-16, then you will need to manually convert the file names to the correct character set.

For example, as follows converting to "GBK" code page:

String path = IInArchive.getStringProperty(i, PropID.PATH);
byte[] ba = path.getBytes();
ByteBuffer bb = ByteBuffer.wrap(ba);
Charset cs = Charset.forName("GBK");
CharsetDecoder cd = cs.newDecoder();
CharBuffer cb = cd.decode(bb);
String gbk_path = cb.toString();

If there is no information about the code page in the archive, then this was not included when the archive was created. In this case, you will need to manually provide this information and ensure that the file names are decoded correctly.

omicronapps avatar Oct 25 '21 01:10 omicronapps

Thanks your answer.

Is there any way to determine whether library has got the information of code Page? I want to convert it to the code page of "GBK" if library has not

EvilThunder avatar Oct 25 '21 01:10 EvilThunder

The 7-Zip-JBinding library will not make any code page conversions. The library will provide all strings from the archive unmodified. That is the strings will be in the same format (code page) as when the archive was originally created.

If the archive includes information in the PropID.CODE_PAGE property, then you can use this information. But if this property does not exist in the archive, then you must know what code page that was used during compression.

I would recommend using UTF-16 when creating new archive files to avoid issues like this. If this is not possible then the application using 7-Zip-JBinding must have information (for example from the user) of the code page of the archive.

omicronapps avatar Oct 27 '21 03:10 omicronapps

If the file names are not encoded with UTF-16, then I will need to manually convert the file names to the correct character set. But how do I determine whether file names are encoded with UTF-16?

Does the library have access to other attributes related to the character set, such as those mentioned in the "APPENDIX D-Language Encoding (EFS)" section in the link?

https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT

EvilThunder avatar Oct 27 '21 12:10 EvilThunder

If the file names are not encoded with UTF-16, then I will need to manually convert the file names to the correct character set.

Yes, correct.

But how do I determine whether file names are encoded with UTF-16?

They only way I'm aware of if is the CODE_PAGE property is set, but it looks like this is not used for Zip archives.

Does the library have access to other attributes related to the character set, such as those mentioned in the "APPENDIX D-Language Encoding (EFS)" section in the link?

No, probably not. It looks to me like 7-Zip only handles the following extra IDs for Zip archives.

ZipHeader.h:

  namespace NExtraID
  {
    enum
    {
      kZip64 = 0x01,
      kNTFS = 0x0A,
      kStrongEncrypt = 0x17,
      kUnixTime = 0x5455,
      kIzUnicodeComment = 0x6375,
      kIzUnicodeName = 0x7075,
      kWzAES = 0x9901
    };
  }

7-Zip-JBinding uses 7-Zip version 16.02.

I checked the latest version 7-Zip version 21.02, but it looks like 0x0008 (PFS) is still not supported for Zip archives: https://sourceforge.net/projects/sevenzip/files/7-Zip/21.02/

You may want to check here about 7-Zip support for Zip archives: https://sourceforge.net/p/sevenzip/support-requests/

omicronapps avatar Oct 29 '21 03:10 omicronapps

Tested this as follows:

I created a 7z-archive using 7-Zip for Windows, that included both folder name and file name with kanji characters. I used Japanese for this particular test, but 7z uses Unicode UTF-8 for such file names, so this should work similarly for Chinese characters too.

Next I extracted this using 7-Zip-JBinding-4Android and was able confirm the correct folder name (not garbled) using IInArchive.getStringProperty(index, PropID.PATH).

I suspect the archive with Chinese folder names that you are testing with is compressed using a different code page / locale. See similar discussion here: https://sourceforge.net/p/p7zip/discussion/383044/thread/3d213124/

Can you extract the archive correctly using the 7-Zip command line tool?

I'm also facing this issue with Chinese file name, I compressed that file using macOS default compression. On extracting this file on Android using this library, I see garbled file names. Any workaround? @omicronapps

asthagarg2428 avatar May 18 '22 08:05 asthagarg2428

When I extract Files. String path = (String) inArchive.getProperty(index,PropID.PATH) if return Chinese will garbled. How can I fix it? Looking forward to your reply. Thanks

Were you able to solve this issue?

asthagarg2428 avatar May 18 '22 15:05 asthagarg2428

If the file names are not encoded with UTF-16, then I will need to manually convert the file names to the correct character set. But how do I determine whether file names are encoded with UTF-16?

Does the library have access to other attributes related to the character set, such as those mentioned in the "APPENDIX D-Language Encoding (EFS)" section in the link?

https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT

How did you handle the case then? I'm also stuck

asthagarg2428 avatar May 18 '22 15:05 asthagarg2428

I'm also facing this issue with Chinese file name, I compressed that file using macOS default compression. On extracting this file on Android using this library, I see garbled file names. Any workaround? Were you able to solve this issue? How did you handle the case then? I'm also stuck

The 7-Zip-JBinding library will not make any code page conversions. The library will provide all strings from the archive unmodified. That is the strings will be in the same format (code page) as when the archive was originally created.

If the file names are not encoded with UTF-16, then you will need to manually convert the file names to the correct character set.

7-Zip does not handle the Language Encoding flag (EFS). This means that it's not possible for 7-Zip to determine the code page.

I would recommend using UTF-16 when creating new archive files to avoid issues like this. If this is not possible then the application using 7-Zip-JBinding must have information (for example from the user) of the code page of the archive.

Additional information in previous replies above.

omicronapps avatar Jun 02 '22 04:06 omicronapps

I'm also facing this issue with Chinese file name, I compressed that file using macOS default compression. On extracting this file on Android using this library, I see garbled file names. Any workaround? Were you able to solve this issue? How did you handle the case then? I'm also stuck

The 7-Zip-JBinding library will not make any code page conversions. The library will provide all strings from the archive unmodified. That is the strings will be in the same format (code page) as when the archive was originally created.

If the file names are not encoded with UTF-16, then you will need to manually convert the file names to the correct character set.

7-Zip does not handle the Language Encoding flag (EFS). This means that it's not possible for 7-Zip to determine the code page.

I would recommend using UTF-16 when creating new archive files to avoid issues like this. If this is not possible then the application using 7-Zip-JBinding must have information (for example from the user) of the code page of the archive.

Additional information in previous replies above.

But it seems that different languages are supported by this library.

  • I renamed a text file to a Chinese name and compressed using 3rd party App- Keka - I was ABLE to extract using 7zip-jbining-4android
  • For the same text file I compressed using MacOS default compression - I was UNABLE to extract using 7zip-jbining-4android

I'm unable to understand the difference between the two and how to solve it

asthagarg2428 avatar Jun 07 '22 12:06 asthagarg2428

7-Zip supports UTF-8 and UTF-16-LE character encoding. Mac/OSX on other hand uses GBK for Chinese characters. It appears that Keka uses UTF-8, which is why this works with 7-Zip.

I would recommend adding a user dialog to manually select between GBK and UTF character encodings when selecting a file for extraction. I'm afraid 7-Zip does not include support for detecting the EFS.

omicronapps avatar Jun 13 '22 00:06 omicronapps