iceberg icon indicating copy to clipboard operation
iceberg copied to clipboard

Can not pull if a very old commit message contains Non-breaking space

Open syrel opened this issue 2 years ago • 5 comments

Hi,

In one of the respositories there is a very old commit (from 2018) with a non-breaking space in the commit message that breaks Iceberg: https://github.com/feenkcom/gtoolkit-phlow/tree/3bf1a9728c47a834e105943d67a2e7dd0f01fb10

This leads to inability to pull new commits because non-breaking space is for some reason considered as an illegal utf8 sequence. Why would Iceberg load such old commits when they are 700 commits behind the HEAD? Screen Shot 2021-07-07 at 10 35 45

Here are the bytes that git_commit_message returns:

#(91 116 97 115 107 105 116 93 160 104 97 110 100 108 101 32 101 120 97 99 116 44 32 115 111 109 101 44 32 97 110 100 32 97 110 121 32 112 114 111 103 114 101 115 115 32 110 111 116 105 102 105 99 97 116 105 111 110) asByteArray utf8Decoded

syrel avatar Jul 07 '21 08:07 syrel

Update: After a bit of digging, we found that git_commit_message does not guarantee the returned message to be encoded in UTF-8. In fact:

Commit log messages are typically encoded in UTF-8, but other extended ASCII encodings are also supported. This includes ISO-8859-x, CP125x and many others, but not UTF-16/32, EBCDIC and CJK multi-byte encodings (GBK, Shift-JIS, Big5, EUC-x, CP9xx etc.). (https://git-scm.com/docs/git-commit/2.13.7#_discussion)

And this is what we see, the encoding of the previously mentioned commit is in fact ISO-8859-1. However, the LGitLibrary bindings indirectly assume it is always UTF8:

LGitLibrary>>#commit_message: commit
	^ self ffiCallSafely: #(String git_commit_message #(self)) options: #()

The return type is declared as String which means it is converted from the ExternalAddress via: ExternalData(ByteArray)>>readStringUTF8.

Any ideas on how to fix it?

syrel avatar Jul 16 '21 09:07 syrel

Hi. I think it's the same issue reported as https://github.com/pharo-vcs/iceberg/issues/1426

tinchodias avatar Jul 16 '21 17:07 tinchodias

I have a fix ready that essentially does this:

When reading the message, try to use the encoding provided in the Git commit header. If that header does not provide the correct encoding (i.e. we get an encoding error), just return the string as-is, since it’s the best we can do.

I’d be happy to contribute this back into Iceberg. Is this approach what we envisioned? If so, what target branch should I use?

hellerve avatar Mar 17 '22 16:03 hellerve

I pushed the fix so it can be inspected for whether it’s desirable in its current state or not.

hellerve avatar Mar 17 '22 16:03 hellerve

Thanks a lot we will have a look.

Ducasse avatar Mar 19 '22 09:03 Ducasse