black
black copied to clipboard
Let string splitters respect `East_Asian_Width` property
Description
This patch changes the preview style so that string splitters respect Unicode East Asian Width property. If you are not familiar to CJK languages it is not clear immediately. Let me elaborate with some examples.
Traditionally, East Asian characters (including punctuations) have taken up space twice than European letters and stops when they are rendered in monospace typeset. Compare the following characters:
abcdefg.
글、字。
The characters at the first line are half-width, and the second line are full-width. (Also note that the last character with a small circle, the East Asian period, is also full-width.) Therefore, if we want to prevent those full-width characters to exceed the maximum columns per line, we need to count their width rather than the number of characters. Again, the following characters:
글、字。
These are just 4 characters, but their total width is 8.
Suppose we want to maintain up to 4 columns per line with the following text:
abcdefg.
글、字。
How should it be then? We want it to look like:
abcd
efg.
글、
字。
However, Black currently turns it into like this:
abcd
efg.
글、字。
It's because Black currently counts the number of characters in the line instead of measuring their width. So, how could we measure the width? How can we tell if a character is full- or half-width? What if half-width characters and full-width ones are mixed in a line? That's why Unicode defined an attribute named East_Asian_Width
. Unicode grouped every single character according to their width in fixed-width typeset.
So, this patch tried to change how Black measure line widths using East_Asian_Width
attribute, and I believe it works well with full-width characters now. However, I am not confident if I implemented things in a proper way, so please let me know if I need to adjust!
Also, I believe it partially addresses #1197, but I touched only string splitters. Other parts probably need to be fixed as well.
Checklist - did you ...
- [x] Add an entry in
CHANGES.md
if necessary? - [x] Add / update tests if necessary?
- [ ] Add new / update outdated documentation?
diff-shades results comparing this PR (82e39e847509ba6a650c06f88ff6449216d91adf) to main (dba3c2695c59fdb11825dbdf8f3b0ab6e0b368b2). The full diff is available in the logs under the "Generate HTML diff report" step.
╭─────────────────────── Summary ────────────────────────╮
│ 2 projects & 6 files changed / 130 changes [+102/-28] │
│ │
│ ... out of 2 418 550 lines, 11 508 files & 23 projects │
╰────────────────────────────────────────────────────────╯
Differences found.
How nice!
But I have a question. How about handling for non string literal such as variable name, function name, class name, comment?
Examples are below
# comment
# 이 값을 건드리면 당신은 야근을 면치 못 할 것입니다. 이상하게 보여도 절대 수정하지 마세요 모두를 위해 그냥 둡시다. - 전임자 드림
ANSWER = 42
# function name
def 영어주소2한글주소(country: str, province: str, city: str, street_address1: str, street_address2: str, extra_data: dict[str, any]) -> str:
pass
@item4 Other than string literals are not covered by this patch. I would appreciate if someone works on it further.
Here's also few more points we need to address in the future:
- There are zero-width characters like U+200B ZERO WIDTH SPACE. They are designated
N
(narrow) according toEast_Asian_Width
, which would be counted as a single column. We need to count them as zero column. - Multiple code points can be combined together and look like a single character when they are rendered. They are called combining characters or grapheme clusters. (E.g., umlaut letters, emoji with skin color.) Those characters should be counted together.
Also, I'm a little worried about performance here. char_width()
is obviously going to be slower than len()
, but will this noticeably affect Black's overall performance?
Possible optimizations:
- Use a fast path for ascii-only strings (
str.isascii
method), since many programs will contain mostly ASCII-only strings - Cache the result of the computation per string; this is only helpful if we often re-compute the width for the same string
I added a fast path for ASCII-only strings as well!
Consider using wcwidth? In my experience in other projects this is more accurate for control characters, non-asian characters, etc. (see this)
Also wcwidth uses lru_cache
which should be helpful for the performance
Consider using wcwidth? In my experience in other projects this is more accurate for control characters, non-asian characters, etc. (see this) Also wcwidth uses
lru_cache
which should be helpful for the performance
Sounds good to me. @JelleZijlstra I wonder if it's appropriate to add a new dependency for this specific logic.
There also exist packages like rich who generates a table with a tool script depend on wcwidth
so that the package itself does not need it as dependency. Same strategy can be adopted here as well. I personally prefer using existing packages instead of copying the code though
Looks like wcwidth
is a pure-Python package with no further dependencies, so it wouldn't be too bad to add.
However, I am a little concerned about making formatting output depend on a third-party package. Right now if you have the same version of Black installed as someone else, you'll get the same formatting. But with wcwidth, perhaps different versions of the library will count line length differently, which would cause different formatting. That could create a confusing experience for users. We could get around it by pinning wcwidth to a specific version, but that's not great either.
If we want to use wcwidth, I'd prefer using a mechanism similar to rich to avoid the issues Jelle brings up. Otherwise pinning is "fine," but that's not great either (may cause dependency conflicts for some users if they use another tool that uses wcwidth). I'm kinda surprised rich seems to use a binary search instead of a dictionary lookup, but maybe there's some complexity I'm overlooking?
I'm kinda surprised rich seems to use a binary search instead of a dictionary lookup, but maybe there's some complexity I'm overlooking?
Actually wcwidth itself is already using bisearch. To my understanding the key reason is that the width information is stored as a interval table instead of a full dictionary. A full dictionary for every unicode character would have ~a million records i guess, but the interval table consists of only ~500 records, as a block of characters tend to have same width. The rich implementation differs from the original wcwidth in that it merges bisearch table for 0-width and 2-width into a single, and that it caches width for strings with less then 512 characters for reuse.
If we decide to generate Black's own table to avoid a runtime dependency on wcwidth, should the generated table be version-controlled, or should it be generated at build time (through a PEP-517-style custom build backend or a release script in the CI configuration)?
While writing a plugin for Hatchling (our build backend) would be super cool, it's probably not the best idea given it still gives room for install-to-install variability as Jelle brought up. It would be rare given almost no one installs from sdist, but the possibility is still there. Extending our release CD would be nice, but we haven't tried adding any automated source changes before and in general it sounds like it could devolve into a debugging nightmare. Doing it the rich way (feel free to add the new logic in subpackage of the Black package) would probably be best (assuming running the generate table script from time to time is basically we all we need to do).
However, I'd like @JelleZijlstra's input.
Generating a file and checking it in seems like the best way. That way, we can make sure it stays the same for the whole year and keep the stable style stable.
Now char_width()
depends on the generated wcwidth()
table instead of unicodedata.east_asian_width()
, and it's LRU-cached. I also changed the fast path condition of str_width()
to address what @ZeroRin pointed out. The script to generate the table is placed in scripts/make_width_table.py, and the generated table is in black/_width_table.py.
@JelleZijlstra Could you take a look again?
@JelleZijlstra I addressed things you pointed out. Could you take a look again?
diff-shades is failing on pandas apparently because there are no files to analyze, cc @ichard26
I rebased commits on the latest main and resolved conflicts (mostly from #1879). Please take a look again!
The test suite apparently failed on the CI job. Should I try to fix it?
@yilei kindly fixed it in #3615. I'll merge in main to hopefully get tests to pass.