iceberg
iceberg copied to clipboard
Can not pull if a very old commit message contains Non-breaking space
Hi,
In one of the respositories there is a very old commit (from 2018) with a non-breaking space in the commit message that breaks Iceberg: https://github.com/feenkcom/gtoolkit-phlow/tree/3bf1a9728c47a834e105943d67a2e7dd0f01fb10
This leads to inability to pull new commits because non-breaking space is for some reason considered as an illegal utf8 sequence. Why would Iceberg load such old commits when they are 700 commits behind the HEAD?
Here are the bytes that git_commit_message
returns:
#(91 116 97 115 107 105 116 93 160 104 97 110 100 108 101 32 101 120 97 99 116 44 32 115 111 109 101 44 32 97 110 100 32 97 110 121 32 112 114 111 103 114 101 115 115 32 110 111 116 105 102 105 99 97 116 105 111 110) asByteArray utf8Decoded
Update: After a bit of digging, we found that git_commit_message
does not guarantee the returned message to be encoded in UTF-8. In fact:
Commit log messages are typically encoded in UTF-8, but other extended ASCII encodings are also supported. This includes ISO-8859-x, CP125x and many others, but not UTF-16/32, EBCDIC and CJK multi-byte encodings (GBK, Shift-JIS, Big5, EUC-x, CP9xx etc.). (https://git-scm.com/docs/git-commit/2.13.7#_discussion)
And this is what we see, the encoding of the previously mentioned commit is in fact ISO-8859-1
. However, the LGitLibrary
bindings indirectly assume it is always UTF8:
LGitLibrary>>#commit_message: commit
^ self ffiCallSafely: #(String git_commit_message #(self)) options: #()
The return type is declared as String
which means it is converted from the ExternalAddress
via: ExternalData(ByteArray)>>readStringUTF8
.
Any ideas on how to fix it?
Hi. I think it's the same issue reported as https://github.com/pharo-vcs/iceberg/issues/1426
I have a fix ready that essentially does this:
When reading the message, try to use the encoding provided in the Git commit header. If that header does not provide the correct encoding (i.e. we get an encoding error), just return the string as-is, since it’s the best we can do.
I’d be happy to contribute this back into Iceberg. Is this approach what we envisioned? If so, what target branch should I use?
I pushed the fix so it can be inspected for whether it’s desirable in its current state or not.
Thanks a lot we will have a look.