Microsoft.PowerShell.Archive
Microsoft.PowerShell.Archive copied to clipboard
Add -EntryEncoding as a parameter on Expand-Archive cmdlet
Summary of the new feature/enhancement
The Extract-Archive
cmdlet does not currently allow to define the expected encoding of file names in the archive to be expanded. This means the cmdlet cannot predictably be used to expand ZIP files created with other tools than PowerShell itself (meaning Compress-Archive
).
For example, the Extract-Archive
cmdlet cannot predictably unpack an archive created from Windows File Explorer (aka Compressed Folders feature) ... unless such archive is only using ASCII127 chars for archive entry names. This is Windows not being compatible with Windows and should be fixed.
Proposed technical implementation details
Unfortunately the ZIP spec does not define in enough detail information for a consumer of an archive to reliable tell which encoding was used for the names of the entries in the archive. Therefore, the only possible solution is to ask the consumer what it should be.
Proposed solution is to allow encoding as a parameter to the Expand-Archive
cmdlet and to document what happens if such parameter is not specified, which is the current behavior. I suggest to name such parameter EntryEncoding
to make it clear that it is about how the ZIP entries are encoded, not encoding of the file content, nor encoding of the archive name itself.
Note: Overall I like what Compress-Archive
is doing, consistently using UTF-8 for the file names, but the truth of the matter here is that most PowerShell users will expect to be able to use Extract-Archive
cmdlet to also expand archives which were not created by PowerShell itself.
Test case
- Create some empty files with names such as
Père-Noël.txt
Plankalkül.txt
Ærø-Å.txt
(simply using some examples from the Western European charset here)
-
Create ZIP archive of these files using Windows Compressed Folders feature (or 7-Zip, or any other ZIP tool for Windows, anything except PowerShell itself).
-
Attempt to unpack the archive from step 2 using the
Expand-Archive
cmdlet. The result should be that file names from step 1 are preserved.
Thanks for opening this issue - a fix is long overdue.
As for:
Therefore, the only possible solution is to ask the consumer what it should be.
While an -Encoding
parameter should definitely be supported, there is a way to provide more meaningful default behavior, which should make things just work in the majority of cases:
-
First, always try to interpret an entry name as UTF-8.
-
Only if that fails, interpret it as OEM-encoded (the legacy encoding by default still used by File Explorer and tools such as 7-Zip).
In fact, this logic seems to already be built into the [System.IO.Compression.ZipFile]::ExtractToDirectory()
method when you pass the active OEM code page as the encoding argument: UTF-8 entries are still properly recognized.
(Strictly speaking, any UTF-8-encoded string is also a technically valid OEM-encoded string, so there is hypothetical ambiguity; in practice, however, a valid UTF-8 byte sequence resulting in a human-readable, intentional file name when interpreted as OEM-encoded is unlikely, and the .NET designers were apparently comfortable to quietly resolve this ambiguity in favor of UTF-8).
Note: I'm assuming that it is the active OEM code page that is to be used, not fixed code page 437
, as used on US-English systems, for instance - some sources seem to suggest the latter, but that doesn't seem plausible.
For instance, thanks to the try-UTF-8-first approach, the following sample command is capable of properly processing an archive test.zip
that contains any of the following: (a) all-UTF-8 entries, (b) all-OEM-entries, (c) mixed-UTF-8-and-OEM entries (which 7-Zip creates if individual entries comprise Unicode characters that cannot represented in the OEM character set).
[System.IO.Compression.ZipFile]::ExtractToDirectory(
"$pwd/test.zip",
"$pwd",
[System.Text.Encoding]::GetEncoding((Get-Culture).TextInfo.OEMCodePage)
)
I agree that if at all possible to improve on the default behavior - i.e. when the proposed -Encoding
parameter is not specified - then that is of course a very good idea. Most people will not have considered encoding in relation to ZIP archives (they probably assume that they just "work") and most people will therefore not think to specify the encoding explicitly.
So, judging from the inactivity of this issue, My take as a user, affected by this defect, the current recommendation from Microsoft is:
This is defacto accepted behavior of Expand-Archive and you should use another tool for uncompressing files. Per recomendation from Microsoft here the recommendation is to use tar instead. ...Yeah, I'm legit a bit salty right now... The in depth analysis of lbruun and mklement0 is much appreciated though :)