pharo icon indicating copy to clipboard operation
pharo copied to clipboard

ZipArchive extract functionality does not have support for filenames containing `utf8` characters

Open a96tudor opened this issue 3 years ago • 3 comments

Bug description When trying to use ZipArchive to extract a zip containing files with non-latin characters in their names (for example German characters such as ä, ö or ü), they are not decoded appropriately. For example, a file Prüfung.txt would be extracted as Prüfung.txt.

To Reproduce Steps to reproduce the behavior:

  1. Run the following Pharo snippet to create a zip archive:
outDir := FileLocator imageDirectory.
fileName := 'Prüfung.txt'.
fileContent := 'Prüfung'.
archivePath := outDir / 'test.zip'.

archive := ZipArchive new.
member := archive 
	addString: fileContent utf8Encoded
	as: fileName.
member isTextFile: true.

archivePath binaryWriteStreamDo: [:aStream | 
	archive writeTo: aStream
].
  1. Expand the archive you created by running:
outDir := FileLocator imageDirectory.
archivePath := outDir / 'test.zip'.

(ZipArchive new readFrom: archivePath) extractAllTo: outDir.
  1. Check the output directory. You should find that Prüfung.txt was extracted as Prüfung.txt.

Expected behavior I would expect Prüfung.txt to be extracted as Prüfung.txt.

Screenshots

Screenshot 2022-05-30 at 14 37 09

Version information:

  • OS: MacOS
  • Version: Monterey 12.1
  • Pharo Version 9.0

Expected development cost After some digging, I discovered that the issue comes from two functions: ZipFileMember>>#readCentralDirectoryFileHeaderFrom: and ZipFileMember>>#readLocalDirectoryFileHeaderFrom:, where filenames are decoded using asString, instead of utf8Decoded. Thus, this is pretty easy fix. Taking into account tests, I think that it'd take 1 hour max to solve this issue.

Suggested fix

  1. Change this line from:
fileName := (aStream next: fileNameLength) asString.

to:

fileName := (aStream next: fileNameLength) utf8Decoded.
  1. Change this line from:
fileName := (aStream next: fileNameLength) asString.

to:

fileName := (aStream next: fileNameLength) utf8Decoded.

a96tudor avatar May 30 '22 12:05 a96tudor

Thanks for opening your first issue! Please check the CONTRIBUTING documents for some tips about which information should be provided. You can find information of how to do a Pull Request here: https://github.com/pharo-project/pharo/wiki/Contribute-a-fix-to-Pharo

GitHub
Pharo is a dynamic reflective pure object-oriented language supporting live programming inspired by Smalltalk. - Contribute a fix to Pharo · pharo-project/pharo Wiki

welcome[bot] avatar May 30 '22 12:05 welcome[bot]

PR for review: https://github.com/pharo-project/pharo/pull/11470

MarcusDenker avatar Jul 20 '22 07:07 MarcusDenker

When I do the change, the CI can not load MC packages:

ZnInvalidUTF8: Illegal continuation byte for utf-8 encoding
ZnUTF8Encoder>>error:
ZnUTF8Encoder>>errorIllegalContinuationByte
ZnUTF8Encoder>>nextCodePointFromStream:
[ :stream |
		[ byteStream atEnd ] whileFalse: [ | codePoint |
			codePoint := self nextCodePointFromStream: byteStream.
			(codePoint > 255 and: [ stream originalContents isWideString not ])
				ifTrue: [ | wideString position |
					position := stream position.
					wideString := WideString from: stream originalContents.
					stream on: wideString; setFrom: position + 1 to: position ].
			stream nextPut: (Character value: codePoint) ] ] in ZnUTF8Encoder(ZnUTFEncoder)>>decodeBytes: in Block: [ :stream |...
String class(SequenceableCollection class)>>new:streamContents:
String class(SequenceableCollection class)>>streamContents:
ZnUTF8Encoder(ZnUTFEncoder)>>decodeBytes:
ByteArray>>decodeWith:
ByteArray>>utf8Decoded
ZipFileMember>>readCentralDirectoryFileHeaderFrom:

MarcusDenker avatar Jul 20 '22 08:07 MarcusDenker