gitlint icon indicating copy to clipboard operation
gitlint copied to clipboard

Line-length-rules don't work correctly with german umlauts

Open dakuenne opened this issue 2 years ago • 5 comments

We have configured the length-rules for title and body and all of them seem to count german umlauts as two characters. I suppose this is an issue of the python len-function which uses the byte-length of a string.

We are currently using python 3.10.4 and gitlint 0.17.0.

Example: 123456789012345678901234567890123456789012345678901234567890123456789012 -> evaluates as 72 characters 12345678901234567890123456789012345678901234567890123456789012345678901ä -> evaluates as 73 characters

dakuenne avatar Jul 28 '22 07:07 dakuenne

AFAIK, that’s only the case in Python 2, not python 3. See below:

$ python
Python 3.10.2 (main, Feb 13 2022, 14:37:03) [GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> len("123456789012345678901234567890123456789012345678901234567890123456789012")
72
>>> len("12345678901234567890123456789012345678901234567890123456789012345678901ä")
72

$ python2
Python 2.7.18 (default, Mar  8 2021, 13:02:45)
[GCC 9.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> len("123456789012345678901234567890123456789012345678901234567890123456789012")
72
>>> len("12345678901234567890123456789012345678901234567890123456789012345678901ä")
73

I double checked this with git and gitlint, just to check nothing else is going on and I can’t reproduce this.

$ git log -1
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
commit d42be31a32c1c60b517d06b11dc70b4af2a07dee (HEAD -> master) ┃
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Author: gïtlint-test-user <[email protected]>
Date:   Mon Aug 1 09:18:51 2022 +0200

    12345678901234567890123456789012345678901234567890123456789012345678901ä

$ gitlint -c T1.line-length=50
1: T1 Title exceeds max length (72>50): "12345678901234567890123456789012345678901234567890123456789012345678901ä"
3: B6 Body message is missing

Can you provide a reproducible example? Thanks!

jorisroovers avatar Aug 01 '22 07:08 jorisroovers

We have added gitlint as hook via gitlint install-hook and only the hook seems to count the characters wrong. If we execute python and gitlint on the command-line everything works fine and we can see the same results as you in your tests. I have added a python --version to the hook to see if there is an old Python 2 in use, but the output shows the expected version 3.10.4. And I forgot to mention, that we are running Gitlint on Windows 10 machines.

dakuenne avatar Aug 01 '22 12:08 dakuenne

And I forgot to mention, that we are running Gitlint on Windows 10 machines.

This might be clue! Gitlint has known issues with unicode on Windows; it's complicated (see #96).

For debugging, can you set your .gitlint config like something this, just to ensure the line length is always printed as a violation.

[title-max-length]
line-length=5

Please then rerun your test, but set GITLINT_DEBUG=1 so we can get a little more info :)

GITLINT_DEBUG=1

Please paste all the hook output here after committing. Thanks!

jorisroovers avatar Aug 03 '22 08:08 jorisroovers

If it can't be fixed, I can live with that. Gitlint is a great tool for improving our commit messages with umlaut-support and without.

From my understanding Gitlint uses gits COMMIT_EDITMSG-file to analyse the message. I have opened this file in Notepad++ and it counts the same way as Gitlint. Umlauts are counted as two characters in Notepad++ too.

I get this output:

DEBUG: gitlint.cli To report issues, please visit https://github.com/jorisroovers/gitlint/issues
DEBUG: gitlint.cli Platform: Windows-10-10.0.19044-SP0
DEBUG: gitlint.cli Python version: 3.10.4 (tags/v3.10.4:9d38120, Mar 23 2022, 23:13:41) [MSC v.1929 64 bit (AMD64)]
DEBUG: gitlint.git ('--version',)
DEBUG: gitlint.cli Git version: git version 2.36.0.windows.1
DEBUG: gitlint.cli Gitlint version: 0.17.0
DEBUG: gitlint.cli GITLINT_USE_SH_LIB: [NOT SET]
DEBUG: gitlint.cli DEFAULT_ENCODING: UTF-8
DEBUG: gitlint.cli Configuration
config-path: C:\Repositories\smart-vms-client\.gitlint
[GENERAL]
extra-path: C:\Repositories\smart-vms-client\tools\gitlint
contrib: []
ignore: B6
ignore-merge-commits: True
ignore-fixup-commits: True
ignore-squash-commits: True
ignore-revert-commits: True
ignore-stdin: False
staged: True
fail-without-commits: False
verbosity: 3
debug: True
target: C:\Repositories\smart-vms-client
[RULES]
  I1: ignore-by-title
     ignore=all
     regex=None
  I2: ignore-by-body
     ignore=all
     regex=None
  I3: ignore-body-lines
     regex=None
  I4: ignore-by-author-name
     ignore=all
     regex=None
  T1: title-max-length
     line-length=5
  T2: title-trailing-whitespace
  T6: title-leading-whitespace
  T3: title-trailing-punctuation
  T4: title-hard-tab
  T5: title-must-not-contain-word
     words=WIP
  T7: title-match-regex
     regex=None
  T8: title-min-length
     min-length=5
  B1: body-max-line-length
     line-length=5
  B5: body-min-length
     min-length=20
  B6: body-is-missing
     ignore-merge-commits=True
  B2: body-trailing-whitespace
  B3: body-hard-tab
  B4: body-first-line-empty
  B7: body-changed-file-mention
     files=
  B8: body-match-regex
     regex=None
  M1: author-valid-email
     regex=[^@ ]+@[^@ ]+\.[^@ ]+

gitlint: checking commit message...
DEBUG: gitlint.cli Fetching additional meta-data from staged commit
DEBUG: gitlint.cli Using --msg-filename.
DEBUG: gitlint.git ('config', '--get', 'core.commentchar')
DEBUG: gitlint.cli Linting 1 commit(s)
DEBUG: gitlint.lint Linting commit [SHA UNKNOWN]
DEBUG: gitlint.git ('config', '--get', 'user.name')
DEBUG: gitlint.git ('config', '--get', 'user.email')
DEBUG: gitlint.git ('rev-parse', '--abbrev-ref', 'HEAD')
DEBUG: gitlint.git ('diff', '--staged', '--name-only', '-r')
DEBUG: gitlint.lint Commit Object
--- Commit Message ----
12345ü
--- Meta info ---------
Author: 
Date:   2022-08-03 10:48:03 +0200
is-merge-commit:  False
is-fixup-commit:  False
is-squash-commit: False
is-revert-commit: False
Branches: ['develop']
Changed Files: ['.gitlint']
-----------------------
1: T1 Title exceeds max length (7>5): "12345ü"
DEBUG: gitlint.cli Exit Code = 1

dakuenne avatar Aug 03 '22 08:08 dakuenne

Ok, thanks for the debug output.

The fact that Notepad++ is giving the same result seems to indicate "it's complicated" and not just a gitlint specific issue. I will keep this open but likely won't immediately work on this given I've tried (and failed) to fix Unicode on Windows multiple times. It also involves working in a virtual windows environment (I'm not a daily windows user), which is a bit tedious in terms of workflow for me.

Thanks for the bug report!

jorisroovers avatar Aug 04 '22 08:08 jorisroovers

I had a chance to look at the sourcecode the last days myself and i think i found the issue - at least for windows. The click-File option looses the encoding of the incoming datastream. If i change line 230 in cli.py from @click.option('--msg-filename', type=click.File(), help="Path to a file containing a commit-msg.") into @click.option('--msg-filename', type=click.File(encoding='UTF-8'), help="Path to a file containing a commit-msg.") the encoding is fine and the umlauts are counted as one character. But i don't know if this has any sideeffects on other platforms or if it is a good way to fix this issue

dakuenne avatar Aug 11 '22 14:08 dakuenne

Great find! Give me some time to try this out and get back to you, probably next week.

jorisroovers avatar Aug 11 '22 16:08 jorisroovers

Ok, I just pushed a variation of this suggested fix to https://github.com/jorisroovers/gitlint/tree/issue-310-umlauts-windows.

My fix uses encoding=gitlint.utils.DEFAULT_ENCODING which falls back to UTF-8 if no other encoding is set. This ensures we respect any environmental overrides. All tests still pass after that change, so I expect this will work.

You can try this out using:

virtualenv gitlint-issue310
source gitlint-issue310/bin/activate
pip install git+https://github.com/jorisroovers/gitlint.git@issue-310-umlauts-windows#subdirectory=gitlint-core`
gitlint --debug # use gitlint as normal

I'll try this out myself on Windows before merging the branch.

jorisroovers avatar Aug 25 '22 08:08 jorisroovers

Manually verified on Windows, merged the fix. Will go out with 0.18.0, whenever that is :) Thanks for the bug report!

jorisroovers avatar Aug 26 '22 08:08 jorisroovers