pharo
pharo copied to clipboard
ZipArchive extract functionality does not have support for filenames containing `utf8` characters
Bug description
When trying to use ZipArchive to extract a zip containing files with non-latin characters in their names (for example German characters such as ä, ö or ü), they are not decoded appropriately. For example, a file Prüfung.txt would be extracted as Prüfung.txt.
To Reproduce Steps to reproduce the behavior:
- Run the following Pharo snippet to create a zip archive:
outDir := FileLocator imageDirectory.
fileName := 'Prüfung.txt'.
fileContent := 'Prüfung'.
archivePath := outDir / 'test.zip'.
archive := ZipArchive new.
member := archive
addString: fileContent utf8Encoded
as: fileName.
member isTextFile: true.
archivePath binaryWriteStreamDo: [:aStream |
archive writeTo: aStream
].
- Expand the archive you created by running:
outDir := FileLocator imageDirectory.
archivePath := outDir / 'test.zip'.
(ZipArchive new readFrom: archivePath) extractAllTo: outDir.
- Check the output directory. You should find that
Prüfung.txtwas extracted asPrüfung.txt.
Expected behavior
I would expect Prüfung.txt to be extracted as Prüfung.txt.
Screenshots
Version information:
- OS: MacOS
- Version: Monterey 12.1
- Pharo Version 9.0
Expected development cost
After some digging, I discovered that the issue comes from two functions: ZipFileMember>>#readCentralDirectoryFileHeaderFrom: and ZipFileMember>>#readLocalDirectoryFileHeaderFrom:, where filenames are decoded using asString, instead of utf8Decoded. Thus, this is pretty easy fix. Taking into account tests, I think that it'd take 1 hour max to solve this issue.
Suggested fix
- Change this line from:
fileName := (aStream next: fileNameLength) asString.
to:
fileName := (aStream next: fileNameLength) utf8Decoded.
- Change this line from:
fileName := (aStream next: fileNameLength) asString.
to:
fileName := (aStream next: fileNameLength) utf8Decoded.
Thanks for opening your first issue! Please check the CONTRIBUTING documents for some tips about which information should be provided. You can find information of how to do a Pull Request here: https://github.com/pharo-project/pharo/wiki/Contribute-a-fix-to-Pharo
GitHub
Pharo is a dynamic reflective pure object-oriented language supporting live programming inspired by Smalltalk. - Contribute a fix to Pharo · pharo-project/pharo Wiki
PR for review: https://github.com/pharo-project/pharo/pull/11470
When I do the change, the CI can not load MC packages:
ZnInvalidUTF8: Illegal continuation byte for utf-8 encoding
ZnUTF8Encoder>>error:
ZnUTF8Encoder>>errorIllegalContinuationByte
ZnUTF8Encoder>>nextCodePointFromStream:
[ :stream |
[ byteStream atEnd ] whileFalse: [ | codePoint |
codePoint := self nextCodePointFromStream: byteStream.
(codePoint > 255 and: [ stream originalContents isWideString not ])
ifTrue: [ | wideString position |
position := stream position.
wideString := WideString from: stream originalContents.
stream on: wideString; setFrom: position + 1 to: position ].
stream nextPut: (Character value: codePoint) ] ] in ZnUTF8Encoder(ZnUTFEncoder)>>decodeBytes: in Block: [ :stream |...
String class(SequenceableCollection class)>>new:streamContents:
String class(SequenceableCollection class)>>streamContents:
ZnUTF8Encoder(ZnUTFEncoder)>>decodeBytes:
ByteArray>>decodeWith:
ByteArray>>utf8Decoded
ZipFileMember>>readCentralDirectoryFileHeaderFrom: