iceberg ZnInvalidUTF8 when reading a commit message

The error message is ZnInvalidUTF8: Illegal leading byte for utf-8 encoding.

This happens in a P9 with Iceberg v2.0.3 . Reproduce with:

Open Iceberg repositories list
Clone https://github.com/pharo-graphics/Bloc/
Open the bloc repository (cmd+r)

The error is in this commit: https://github.com/pharo-graphics/Bloc/commit/8237f3e5ed9f8100639419e5b677a2b97cc54c55

Feb 24 '21 19:02 tinchodias

It happens during the execution of this method:

LGitCommit >>
commit_message: commit
	
	^ self ffiCallSafely: #(String git_commit_message #(self)) options: #()

Feb 24 '21 19:02 tinchodias

Important

I can reproduce it with latest vm:

curl https://get.pharo.org/64/90+vmHeadlessLatest | bash

but not with:

curl https://get.pharo.org/64/90+vm | bash

Feb 24 '21 20:02 tinchodias

After cloning the Bloc repository, the error can be reproduced with:

repo := IceRepository registry 
	detect: [ :repo | repo name = 'Bloc' ].
	
(repo commitishNamed: '8237f3e5ed9f8100639419e5b677a2b97cc54c55') libgitCommit message.

Feb 24 '21 21:02 tinchodias

Note

The message is printed as [baseline] Bloc loads Bloc-TaskIt package , but the raw byte of the space after [baseline] is not the Character space but the byte 160.

I debugged in both images (or both VMs, in fact), and checked that the raw bytes obtained from ligbit2 are the same.


repo := IceRepository registry detect: [ :repo | repo name = 'Bloc' ].
commit := (repo commitishNamed: '8237f3e5ed9f8100639419e5b677a2b97cc54c55') libgitCommit.
workingFineMessage := commit message.
nonWorkingMessage := (commit commit_message_raw) fromCStringRaw utf8Decoded.

3 methods to execute the snippet: fileout.zip

Feb 24 '21 21:02 tinchodias

But if as I observed, in both cases the raw bytes from libgit2 are the same, then the error can be reproduced with this snippet:

ZnCharacterEncoder utf8 decodeBytes:  #[91 98 97 115 101 108 105 110 101 93 160 66 108 111 99 32 108 111 97 100 115 32 66 108 111 99 45 84 97 115 107 73 116 32 112 97 99 107 97 103 101]

And I'm lost why LGitCommit>>message tolerates the byte 160 and not in the other case.

Feb 24 '21 21:02 tinchodias

@guillep discovered it can be decoded correctly with a different decoder, using detectEncoding: like this:

bytes := #[91 98 97 115 101 108 105 110 101 93 160 66 108 111 99 32 108 111 97 100 115 32 66 108 111 99 45 84 97 115 107 73 116 32 112 97 99 107 97 103 101].

e := ZnCharacterEncoder detectEncoding: bytes. 
">>> a ZnSimplifiedByteEncoder('iso88591' strict)"

e decodeBytes: bytes  
">>> '[baseline] Bloc loads Bloc-TaskIt package'"

Then, I thought the problem was going to be fixed with this "todo":

LGitCommit >>
message
	<todo: 'use encoding to properly read the message'>
	|  encoding |
	encoding := self commit_message_encoding: self.
	^ self commit_message: self

But, to my surprise, the answer of commit_message_encoding:is 'UTF-8'. So, libgit2 tells us to use UTF8...

Mar 01 '21 11:03 tinchodias

My fix includes 3 fileouts including:

LGitCommit >>
message

	| bytes |
	bytes := (self commit_message: self) fromCStringRaw.
	
	[ ^ ZnCharacterEncoder utf8 decodeBytes: bytes ]
		on: ZnCharacterEncodingError
		do:[ (ZnCharacterEncoder detectEncoding: bytes) decodeBytes: bytes ]

But also 2 fromCStringRawin FFI classes, since fromCString already assumes a UTF-8 encoding (I think this is wrong).

Fileouts: Fix.zip

Mar 01 '21 17:03 tinchodias

My previous zip lacks 1 extra change, included here: Fix2.zip

May 04 '21 00:05 tinchodias

we merged https://github.com/pharo-project/pharo/pull/11239

Can you check if this fixes the issue?

May 30 '22 09:05 MarcusDenker

@tinchodias do the PR, come on... now @guillep is very disappointed and has to do it. You own Guille medialunas.

Sep 06 '22 09:09 tesonep

iceberg iceberg copied to clipboard

ZnInvalidUTF8 when reading a commit message

iceberg
iceberg copied to clipboard