unshield icon indicating copy to clipboard operation
unshield copied to clipboard

How to extract files with non-English filenames

Open hgdagon opened this issue 6 years ago • 20 comments

I'm extracting this old Korean game. It has only one file with the filename in Korean, and that file is either extracted with gibberish as filenames, or errors out.

No options or -R results in:

extracting: ./App_Executables/custom/A_AúÄ¿½ºÅO½ºÅ²,,µå'A1y.txt

Extracting with -e results in:

Could not encode text to 'EUC-KR' error Illegal byte sequence
Failed to extract file 'A_AúÄ¿½ºÅO½ºÅ²,,µå'A1y.txt'.Run unshield again with -D 3 for more information.

I've tried EUC-KR, ISO-2022-KR, and KSC_5601. KSC_5601 didn't extract any of the files.

The correct filename is 유저커스텀스킨만드는법.txt

P.S. On a slightly unrelated note, is there a generic command to extract only the program files, skipping the IS internal files? As far as I understand the group(s) containing the program files vary between IS versions (or every single program).

hgdagon avatar Oct 21 '18 20:10 hgdagon

@hgdagon I think you are using too modern charsets. It must be some old Microsoft Windows charset. Please try -e CP949 and report your results here!

twogood avatar Oct 22 '18 06:10 twogood

As for your P.S. there is -g GROUP Only list/extract this file group, if it helps? The help text is also prepared for -c COMPONENT Only list/extract this component but I have never implemented that :(

twogood avatar Oct 22 '18 06:10 twogood

CP949 and ISO-2022-KR don't work either, same error. I thought I searched for all Korean encodings... Maybe there's any indication of the codepage in the setup itself, or some documentation on what encoding IS uses for Korean by default? I couldn't find any myself. As far as extracting program files only. This setup contains all non-IS files in a group called App_Executables, but I remember seeing installers that did not contain such group at all, so the group name is arbitrary, either in each version of IS, or every single program. Your solution still requires manually taking a look at groups/components, while I'm looking for a more generic approach (that would work with any installer) to just skip IS's own internal files when extracting. Anyway, the thing is I was hoping unshield to be this magical AIO solution for all the headache with old IS installers, thought I could just slap it into Inno Setup and have a generic installer for any old CD I come across. So, it's not as trivial as the encoding issue.

Thanks for the response!

hgdagon avatar Oct 23 '18 23:10 hgdagon

Can you share data*.cab and data*.hdr with me somehow? For example Dropbox, Google drive, or a regular web link?

twogood avatar Oct 24 '18 04:10 twogood

https://www.dropbox.com/s/3qq6v83c27alfz9/Disk1.7z?dl=1

hgdagon avatar Oct 26 '18 03:10 hgdagon

Hi again, sorry for the delayed response!

For me it extracts fine with -e euc-kr but when listing the file the name is not converted!

unshield -e euc-kr x data1.cab  
Cabinet: data1.cab
...
  extracting: ./App_Executables/custom/유저커스텀스킨만드는법.txt
...
 --------  -------
          67 files
unshield -e EUC-KR l data1.cab
Cabinet: data1.cab
...
      978  App Executables\custom\����Ŀ���ҽ�Ų�����¹�.txt
 --------  -------
          67 files

So maybe I should actually use the -e parameter when listing files too :)

What version of unshield and libiconv are you using as this command didn't work for you?

twogood avatar Oct 30 '18 17:10 twogood

Now I see, you are missing this bug fix: https://github.com/twogood/unshield/pull/76/commits/592f1d62d97b48064509137b0dbc79241800d1d8

twogood avatar Oct 30 '18 18:10 twogood

So this is a duplicate of #77 but I haven't made any new unshield release since then, sorry about that!

twogood avatar Oct 30 '18 18:10 twogood

@hgdagon Please try unshield 1.4.3: https://github.com/twogood/unshield/releases/tag/1.4.3

twogood avatar Oct 30 '18 19:10 twogood

Well, according to the timestamps, I cloned the repo on Sep 14, and:

local/mingw-w64-x86_64-libiconv 1.15-3

So, I pulled the update, recompiled it, and, I regret to say, it's still the same on my side, Illegal byte sequence... Tried both static and shared builds.

I probably should've mentioned before that I'm building on Windows in msys2 (MinGW-w64), although I don't see why that should be an issue.

hgdagon avatar Oct 31 '18 02:10 hgdagon

Thank you @hgdagon for your prompt reply. I guess I'll have to try this in a Windows environment then. I'm not sure if the -e parameter has ever been tested on Windows.

If you run the iconv command line tool in your Windows environment with the -l command line parameter, is EUC-KR included in the list?

iconv -l |grep -i EUC-KR

twogood avatar Oct 31 '18 07:10 twogood

In the meantime, could you try to change the first parameter of the iconv_open call on line 762 in unshield.c from an empty string to UTF-8?

Current: if ((encoding_descriptor = iconv_open("", encoding)) == (iconv_t)-1) Modified: if ((encoding_descriptor = iconv_open("UTF-8", encoding)) == (iconv_t)-1)

twogood avatar Oct 31 '18 07:10 twogood

If you run the iconv command line tool in your Windows environment with the -l command line parameter, is EUC-KR included in the list?

iconv -l |grep -i EUC-KR

EUC-KR EUCKR CSEUCKR

In the meantime, could you try to change the first parameter of the iconv_open call on line 762 in unshield.c from an empty string to UTF-8?

That resulted in this:

extracting: ./App_Executables/custom/ìo ì ?ì»ìSí.?ìSí,"ëOë"oëS"ë².txt

That's what the output says, and the filename is slightly different:

유저커스텀스킨만드는법.txt

hgdagon avatar Oct 31 '18 09:10 hgdagon

@hgdagon: What is your character set in the terminal? Like the contents of the LANG environment variable? For example:

$ echo $LANG
en_US.UTF-8

twogood avatar Oct 31 '18 10:10 twogood

The character set in Msys2 shell is actually en_US.UTF-8. But I'm running the executable in command prompt. I tried running in the shell now and I did see the correct filename.

extracting: ./App_Executables/custom/유저커스텀스킨만드는법.txt

But only in the shell output, the actual filename is still the same:

유저커스텀스킨만드는법.txt

hgdagon avatar Oct 31 '18 10:10 hgdagon

What is your character set in Windows?

twogood avatar Oct 31 '18 11:10 twogood

Um... standard English, I would assume... Whatever comes with Win10, I don't have any weird MUIs installed or anything.

hgdagon avatar Oct 31 '18 12:10 hgdagon

I tried extracting on my Linux machine to see if the same errors occur.

Unshield 1.4.2 (current release in Manjaro repo) weirdly enough resulted in this error for every file:

Could not encode text to 'EUC-KR' error Argument list too long

So, instead I installed unshield-git form AUR (here's the PKGBUILD) and then it worked like a charm: no error and correct filename.

Considering this, I'm gonna try building in MSVC and comment back the results. I'm not entirely sure how to get libraries for MSVC, so it's gonna take some time. Will comment back sometime today or tomorrow.

hgdagon avatar Oct 31 '18 14:10 hgdagon

Well, it's been a week, and, sadly, Visual Studio and I still don't speak the same language. Since the last time I tackled VS, apparently, there's this new thing called vcpkg, which I thought would bring some sense into the whole mess, but it just doesn't work. At least, I couldn't get it to work. Anyway, I can confirm that when built with msys2(MinGW-w64), encoding conversion doesn't work.

hgdagon avatar Nov 08 '18 02:11 hgdagon

Thanks for your update. For me, I've still not tried it on MS Windows myself yet.

twogood avatar Nov 11 '18 11:11 twogood