gitlint
gitlint copied to clipboard
Line-length-rules don't work correctly with german umlauts
We have configured the length-rules for title and body and all of them seem to count german umlauts as two characters. I suppose this is an issue of the python len-function which uses the byte-length of a string.
We are currently using python 3.10.4 and gitlint 0.17.0.
Example: 123456789012345678901234567890123456789012345678901234567890123456789012 -> evaluates as 72 characters 12345678901234567890123456789012345678901234567890123456789012345678901ä -> evaluates as 73 characters
AFAIK, that’s only the case in Python 2, not python 3. See below:
$ python
Python 3.10.2 (main, Feb 13 2022, 14:37:03) [GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> len("123456789012345678901234567890123456789012345678901234567890123456789012")
72
>>> len("12345678901234567890123456789012345678901234567890123456789012345678901ä")
72
$ python2
Python 2.7.18 (default, Mar 8 2021, 13:02:45)
[GCC 9.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> len("123456789012345678901234567890123456789012345678901234567890123456789012")
72
>>> len("12345678901234567890123456789012345678901234567890123456789012345678901ä")
73
I double checked this with git and gitlint, just to check nothing else is going on and I can’t reproduce this.
$ git log -1
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
commit d42be31a32c1c60b517d06b11dc70b4af2a07dee (HEAD -> master) ┃
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Author: gïtlint-test-user <[email protected]>
Date: Mon Aug 1 09:18:51 2022 +0200
12345678901234567890123456789012345678901234567890123456789012345678901ä
$ gitlint -c T1.line-length=50
1: T1 Title exceeds max length (72>50): "12345678901234567890123456789012345678901234567890123456789012345678901ä"
3: B6 Body message is missing
Can you provide a reproducible example? Thanks!
We have added gitlint as hook via gitlint install-hook
and only the hook seems to count the characters wrong. If we execute python and gitlint on the command-line everything works fine and we can see the same results as you in your tests.
I have added a python --version
to the hook to see if there is an old Python 2 in use, but the output shows the expected version 3.10.4.
And I forgot to mention, that we are running Gitlint on Windows 10 machines.
And I forgot to mention, that we are running Gitlint on Windows 10 machines.
This might be clue! Gitlint has known issues with unicode on Windows; it's complicated (see #96).
For debugging, can you set your .gitlint
config like something this, just to ensure the line length is always printed as a violation.
[title-max-length]
line-length=5
Please then rerun your test, but set GITLINT_DEBUG=1
so we can get a little more info :)
GITLINT_DEBUG=1
Please paste all the hook output here after committing. Thanks!
If it can't be fixed, I can live with that. Gitlint is a great tool for improving our commit messages with umlaut-support and without.
From my understanding Gitlint uses gits COMMIT_EDITMSG-file to analyse the message. I have opened this file in Notepad++ and it counts the same way as Gitlint. Umlauts are counted as two characters in Notepad++ too.
I get this output:
DEBUG: gitlint.cli To report issues, please visit https://github.com/jorisroovers/gitlint/issues
DEBUG: gitlint.cli Platform: Windows-10-10.0.19044-SP0
DEBUG: gitlint.cli Python version: 3.10.4 (tags/v3.10.4:9d38120, Mar 23 2022, 23:13:41) [MSC v.1929 64 bit (AMD64)]
DEBUG: gitlint.git ('--version',)
DEBUG: gitlint.cli Git version: git version 2.36.0.windows.1
DEBUG: gitlint.cli Gitlint version: 0.17.0
DEBUG: gitlint.cli GITLINT_USE_SH_LIB: [NOT SET]
DEBUG: gitlint.cli DEFAULT_ENCODING: UTF-8
DEBUG: gitlint.cli Configuration
config-path: C:\Repositories\smart-vms-client\.gitlint
[GENERAL]
extra-path: C:\Repositories\smart-vms-client\tools\gitlint
contrib: []
ignore: B6
ignore-merge-commits: True
ignore-fixup-commits: True
ignore-squash-commits: True
ignore-revert-commits: True
ignore-stdin: False
staged: True
fail-without-commits: False
verbosity: 3
debug: True
target: C:\Repositories\smart-vms-client
[RULES]
I1: ignore-by-title
ignore=all
regex=None
I2: ignore-by-body
ignore=all
regex=None
I3: ignore-body-lines
regex=None
I4: ignore-by-author-name
ignore=all
regex=None
T1: title-max-length
line-length=5
T2: title-trailing-whitespace
T6: title-leading-whitespace
T3: title-trailing-punctuation
T4: title-hard-tab
T5: title-must-not-contain-word
words=WIP
T7: title-match-regex
regex=None
T8: title-min-length
min-length=5
B1: body-max-line-length
line-length=5
B5: body-min-length
min-length=20
B6: body-is-missing
ignore-merge-commits=True
B2: body-trailing-whitespace
B3: body-hard-tab
B4: body-first-line-empty
B7: body-changed-file-mention
files=
B8: body-match-regex
regex=None
M1: author-valid-email
regex=[^@ ]+@[^@ ]+\.[^@ ]+
gitlint: checking commit message...
DEBUG: gitlint.cli Fetching additional meta-data from staged commit
DEBUG: gitlint.cli Using --msg-filename.
DEBUG: gitlint.git ('config', '--get', 'core.commentchar')
DEBUG: gitlint.cli Linting 1 commit(s)
DEBUG: gitlint.lint Linting commit [SHA UNKNOWN]
DEBUG: gitlint.git ('config', '--get', 'user.name')
DEBUG: gitlint.git ('config', '--get', 'user.email')
DEBUG: gitlint.git ('rev-parse', '--abbrev-ref', 'HEAD')
DEBUG: gitlint.git ('diff', '--staged', '--name-only', '-r')
DEBUG: gitlint.lint Commit Object
--- Commit Message ----
12345ü
--- Meta info ---------
Author:
Date: 2022-08-03 10:48:03 +0200
is-merge-commit: False
is-fixup-commit: False
is-squash-commit: False
is-revert-commit: False
Branches: ['develop']
Changed Files: ['.gitlint']
-----------------------
1: T1 Title exceeds max length (7>5): "12345ü"
DEBUG: gitlint.cli Exit Code = 1
Ok, thanks for the debug output.
The fact that Notepad++ is giving the same result seems to indicate "it's complicated" and not just a gitlint specific issue. I will keep this open but likely won't immediately work on this given I've tried (and failed) to fix Unicode on Windows multiple times. It also involves working in a virtual windows environment (I'm not a daily windows user), which is a bit tedious in terms of workflow for me.
Thanks for the bug report!
I had a chance to look at the sourcecode the last days myself and i think i found the issue - at least for windows.
The click-File option looses the encoding of the incoming datastream. If i change line 230 in cli.py from @click.option('--msg-filename', type=click.File(), help="Path to a file containing a commit-msg.")
into @click.option('--msg-filename', type=click.File(encoding='UTF-8'), help="Path to a file containing a commit-msg.")
the encoding is fine and the umlauts are counted as one character.
But i don't know if this has any sideeffects on other platforms or if it is a good way to fix this issue
Great find! Give me some time to try this out and get back to you, probably next week.
Ok, I just pushed a variation of this suggested fix to https://github.com/jorisroovers/gitlint/tree/issue-310-umlauts-windows.
My fix uses encoding=gitlint.utils.DEFAULT_ENCODING
which falls back to UTF-8
if no other encoding is set. This ensures we respect any environmental overrides.
All tests still pass after that change, so I expect this will work.
You can try this out using:
virtualenv gitlint-issue310
source gitlint-issue310/bin/activate
pip install git+https://github.com/jorisroovers/gitlint.git@issue-310-umlauts-windows#subdirectory=gitlint-core`
gitlint --debug # use gitlint as normal
I'll try this out myself on Windows before merging the branch.
Manually verified on Windows, merged the fix. Will go out with 0.18.0, whenever that is :) Thanks for the bug report!