iceberg
iceberg copied to clipboard
ZnInvalidUTF8 when reading a commit message
The error message is ZnInvalidUTF8: Illegal leading byte for utf-8 encoding.
This happens in a P9 with Iceberg v2.0.3 . Reproduce with:
- Open Iceberg repositories list
- Clone https://github.com/pharo-graphics/Bloc/
- Open the bloc repository (cmd+r)
The error is in this commit: https://github.com/pharo-graphics/Bloc/commit/8237f3e5ed9f8100639419e5b677a2b97cc54c55
It happens during the execution of this method:
LGitCommit >>
commit_message: commit
^ self ffiCallSafely: #(String git_commit_message #(self)) options: #()
Important
I can reproduce it with latest vm:
curl https://get.pharo.org/64/90+vmHeadlessLatest | bash
but not with:
curl https://get.pharo.org/64/90+vm | bash
After cloning the Bloc repository, the error can be reproduced with:
repo := IceRepository registry
detect: [ :repo | repo name = 'Bloc' ].
(repo commitishNamed: '8237f3e5ed9f8100639419e5b677a2b97cc54c55') libgitCommit message.
Note
The message is printed as [baseline] Bloc loads Bloc-TaskIt package , but the raw byte of the space after [baseline] is not the Character space but the byte 160.
I debugged in both images (or both VMs, in fact), and checked that the raw bytes obtained from ligbit2 are the same.
repo := IceRepository registry detect: [ :repo | repo name = 'Bloc' ].
commit := (repo commitishNamed: '8237f3e5ed9f8100639419e5b677a2b97cc54c55') libgitCommit.
workingFineMessage := commit message.
nonWorkingMessage := (commit commit_message_raw) fromCStringRaw utf8Decoded.
3 methods to execute the snippet: fileout.zip
But if as I observed, in both cases the raw bytes from libgit2 are the same, then the error can be reproduced with this snippet:
ZnCharacterEncoder utf8 decodeBytes: #[91 98 97 115 101 108 105 110 101 93 160 66 108 111 99 32 108 111 97 100 115 32 66 108 111 99 45 84 97 115 107 73 116 32 112 97 99 107 97 103 101]
And I'm lost why LGitCommit>>message tolerates the byte 160 and not in the other case.
@guillep discovered it can be decoded correctly with a different decoder, using detectEncoding: like this:
bytes := #[91 98 97 115 101 108 105 110 101 93 160 66 108 111 99 32 108 111 97 100 115 32 66 108 111 99 45 84 97 115 107 73 116 32 112 97 99 107 97 103 101].
e := ZnCharacterEncoder detectEncoding: bytes.
">>> a ZnSimplifiedByteEncoder('iso88591' strict)"
e decodeBytes: bytes
">>> '[baseline] Bloc loads Bloc-TaskIt package'"
Then, I thought the problem was going to be fixed with this "todo":
LGitCommit >>
message
<todo: 'use encoding to properly read the message'>
| encoding |
encoding := self commit_message_encoding: self.
^ self commit_message: self
But, to my surprise, the answer of commit_message_encoding:is 'UTF-8'. So, libgit2 tells us to use UTF8...
My fix includes 3 fileouts including:
LGitCommit >>
message
| bytes |
bytes := (self commit_message: self) fromCStringRaw.
[ ^ ZnCharacterEncoder utf8 decodeBytes: bytes ]
on: ZnCharacterEncodingError
do:[ (ZnCharacterEncoder detectEncoding: bytes) decodeBytes: bytes ]
But also 2 fromCStringRawin FFI classes, since fromCString already assumes a UTF-8 encoding (I think this is wrong).
Fileouts: Fix.zip
My previous zip lacks 1 extra change, included here: Fix2.zip
we merged https://github.com/pharo-project/pharo/pull/11239
Can you check if this fixes the issue?
@tinchodias do the PR, come on... now @guillep is very disappointed and has to do it. You own Guille medialunas.