Black joins lines when it shouldn't - short width characters.
Describe the bug
Black doesn't count line length correctly in case of khmer characters. They fit screen-wise (kind of) but character length goes beyond limit.
To Reproduce
For example, take this code:
@pytest.mark.parametrize(
"text, expected_language",
[
(
(
"សម្រស់ទាវ២០២២ មិនធម្មតា ឥឡូវកំពុងរកតួ នេនទុំ និងពេជ្រ"
" ប្រញាប់ឡើងទាន់គេមានបញ្ហាត្រូវថតឡើងវិញ "
),
"km",
), # Khmer
],
)
def test_detect_language(text, expected_language):
assert detect_lang_of_text(text) == expected_language
And run it with these arguments:
$ black file.py --target-version py312 --line-length 100 --diff
The resulting error is:
@@ -1,13 +1,10 @@
@pytest.mark.parametrize(
"text, expected_language",
[
(
- (
- "សម្រស់ទាវ២០២២ មិនធម្មតា ឥឡូវកំពុងរកតួ នេនទុំ និងពេជ្រ"
- " ប្រញាប់ឡើងទាន់គេមានបញ្ហាត្រូវថតឡើងវិញ "
- ),
+ ("សម្រស់ទាវ២០២២ មិនធម្មតា ឥឡូវកំពុងរកតួ នេនទុំ និងពេជ្រ" " ប្រញាប់ឡើងទាន់គេមានបញ្ហាត្រូវថតឡើងវិញ "),
"km",
), # Khmer
],
)
def test_detect_language(text, expected_language):
would reformat file.py
All done! ✨ 🍰 ✨
1 file would be reformatted.
Expected behavior
No changes because line won't fit.
Environment
- Black's version:
black, 24.2.0 (compiled: yes)
Python (CPython) 3.12.1
- OS and Python version: Linux/Python 3.12.1
Additional context
The line is split correctly before changes but when joined like black does (even when we remove " " and parens) then it doesn't fit inside 100 characters.
This was changed as a result of #3445.
Could you explain why you think this is a bug? The line is displayed as <100 characters in length, even if it contains more than 100 characters.
In our case this is a problem because our Sonar later reports that line goes over 100 characters. I've used # fmt: off to prevent black from changing it.
I also use Pycharm with visual lines to indicate if it goes over 100 and 120 and with black' formatting it looks like this:
It clearly goes over the line which indicates that depending on editor the line may visually go over or not which is inconsistent. In this case if it relied on character count it would always go below visual limit (unless there are wider characters which count as 1 character which I don't know of).
unless there are wider characters which count as 1 character which I don't know of
This is fairly common for Chinese characters, see the issue I linked.
I assume Sonar is https://www.sonarsource.com/products/sonarlint/ ? Possibly linters should also adjust their line length computations to take character width into account. However, that might be more trouble than it's worth. I think I'd be OK with adjusting our logic to treat all characters as having a minimum length of 1. This would go into the preview style now and into the stable style for 2025.