black icon indicating copy to clipboard operation
black copied to clipboard

Let string splitters respect `East_Asian_Width` property

Open dahlia opened this issue 2 years ago • 17 comments

Description

This patch changes the preview style so that string splitters respect Unicode East Asian Width property. If you are not familiar to CJK languages it is not clear immediately. Let me elaborate with some examples.

Traditionally, East Asian characters (including punctuations) have taken up space twice than European letters and stops when they are rendered in monospace typeset. Compare the following characters:

abcdefg.
글、字。

The characters at the first line are half-width, and the second line are full-width. (Also note that the last character with a small circle, the East Asian period, is also full-width.) Therefore, if we want to prevent those full-width characters to exceed the maximum columns per line, we need to count their width rather than the number of characters. Again, the following characters:

글、字。

These are just 4 characters, but their total width is 8.

Suppose we want to maintain up to 4 columns per line with the following text:

abcdefg.
글、字。

How should it be then? We want it to look like:

abcd
efg.
글、
字。

However, Black currently turns it into like this:

abcd
efg.
글、字。

It's because Black currently counts the number of characters in the line instead of measuring their width. So, how could we measure the width? How can we tell if a character is full- or half-width? What if half-width characters and full-width ones are mixed in a line? That's why Unicode defined an attribute named East_Asian_Width. Unicode grouped every single character according to their width in fixed-width typeset.

So, this patch tried to change how Black measure line widths using East_Asian_Width attribute, and I believe it works well with full-width characters now. However, I am not confident if I implemented things in a proper way, so please let me know if I need to adjust!

Also, I believe it partially addresses #1197, but I touched only string splitters. Other parts probably need to be fixed as well.

Checklist - did you ...

  • [x] Add an entry in CHANGES.md if necessary?
  • [x] Add / update tests if necessary?
  • [ ] Add new / update outdated documentation?

dahlia avatar Dec 17 '22 07:12 dahlia

diff-shades results comparing this PR (82e39e847509ba6a650c06f88ff6449216d91adf) to main (dba3c2695c59fdb11825dbdf8f3b0ab6e0b368b2). The full diff is available in the logs under the "Generate HTML diff report" step.

╭─────────────────────── Summary ────────────────────────╮
│ 2 projects & 6 files changed / 130 changes [+102/-28]  │
│                                                        │
│ ... out of 2 418 550 lines, 11 508 files & 23 projects │
╰────────────────────────────────────────────────────────╯

Differences found.

What is this? | Workflow run | diff-shades documentation

github-actions[bot] avatar Dec 17 '22 16:12 github-actions[bot]

How nice!

But I have a question. How about handling for non string literal such as variable name, function name, class name, comment?

Examples are below

# comment

# 이 값을 건드리면 당신은 야근을 면치 못 할 것입니다. 이상하게 보여도 절대 수정하지 마세요 모두를 위해 그냥 둡시다. - 전임자 드림
ANSWER = 42
# function name

def 영어주소2한글주소(country: str, province: str, city: str, street_address1: str, street_address2: str, extra_data: dict[str, any]) -> str:
    pass

item4 avatar Dec 18 '22 02:12 item4

@item4 Other than string literals are not covered by this patch. I would appreciate if someone works on it further.

dahlia avatar Dec 18 '22 02:12 dahlia

Here's also few more points we need to address in the future:

  • There are zero-width characters like U+200B ZERO WIDTH SPACE. They are designated N (narrow) according to East_Asian_Width, which would be counted as a single column. We need to count them as zero column.
  • Multiple code points can be combined together and look like a single character when they are rendered. They are called combining characters or grapheme clusters. (E.g., umlaut letters, emoji with skin color.) Those characters should be counted together.

dahlia avatar Dec 18 '22 02:12 dahlia

Also, I'm a little worried about performance here. char_width() is obviously going to be slower than len(), but will this noticeably affect Black's overall performance?

Possible optimizations:

  • Use a fast path for ascii-only strings (str.isascii method), since many programs will contain mostly ASCII-only strings
  • Cache the result of the computation per string; this is only helpful if we often re-compute the width for the same string

JelleZijlstra avatar Dec 24 '22 04:12 JelleZijlstra

I added a fast path for ASCII-only strings as well!

dahlia avatar Dec 25 '22 19:12 dahlia

Consider using wcwidth? In my experience in other projects this is more accurate for control characters, non-asian characters, etc. (see this) Also wcwidth uses lru_cache which should be helpful for the performance

ZeroRin avatar Jan 05 '23 06:01 ZeroRin

Consider using wcwidth? In my experience in other projects this is more accurate for control characters, non-asian characters, etc. (see this) Also wcwidth uses lru_cache which should be helpful for the performance

Sounds good to me. @JelleZijlstra I wonder if it's appropriate to add a new dependency for this specific logic.

dahlia avatar Jan 05 '23 11:01 dahlia

There also exist packages like rich who generates a table with a tool script depend on wcwidth so that the package itself does not need it as dependency. Same strategy can be adopted here as well. I personally prefer using existing packages instead of copying the code though

ZeroRin avatar Jan 05 '23 11:01 ZeroRin

Looks like wcwidth is a pure-Python package with no further dependencies, so it wouldn't be too bad to add.

However, I am a little concerned about making formatting output depend on a third-party package. Right now if you have the same version of Black installed as someone else, you'll get the same formatting. But with wcwidth, perhaps different versions of the library will count line length differently, which would cause different formatting. That could create a confusing experience for users. We could get around it by pinning wcwidth to a specific version, but that's not great either.

JelleZijlstra avatar Jan 05 '23 15:01 JelleZijlstra

If we want to use wcwidth, I'd prefer using a mechanism similar to rich to avoid the issues Jelle brings up. Otherwise pinning is "fine," but that's not great either (may cause dependency conflicts for some users if they use another tool that uses wcwidth). I'm kinda surprised rich seems to use a binary search instead of a dictionary lookup, but maybe there's some complexity I'm overlooking?

ichard26 avatar Jan 05 '23 19:01 ichard26

I'm kinda surprised rich seems to use a binary search instead of a dictionary lookup, but maybe there's some complexity I'm overlooking?

Actually wcwidth itself is already using bisearch. To my understanding the key reason is that the width information is stored as a interval table instead of a full dictionary. A full dictionary for every unicode character would have ~a million records i guess, but the interval table consists of only ~500 records, as a block of characters tend to have same width. The rich implementation differs from the original wcwidth in that it merges bisearch table for 0-width and 2-width into a single, and that it caches width for strings with less then 512 characters for reuse.

ZeroRin avatar Jan 05 '23 23:01 ZeroRin

If we decide to generate Black's own table to avoid a runtime dependency on wcwidth, should the generated table be version-controlled, or should it be generated at build time (through a PEP-517-style custom build backend or a release script in the CI configuration)?

dahlia avatar Jan 08 '23 06:01 dahlia

While writing a plugin for Hatchling (our build backend) would be super cool, it's probably not the best idea given it still gives room for install-to-install variability as Jelle brought up. It would be rare given almost no one installs from sdist, but the possibility is still there. Extending our release CD would be nice, but we haven't tried adding any automated source changes before and in general it sounds like it could devolve into a debugging nightmare. Doing it the rich way (feel free to add the new logic in subpackage of the Black package) would probably be best (assuming running the generate table script from time to time is basically we all we need to do).

However, I'd like @JelleZijlstra's input.

ichard26 avatar Jan 08 '23 20:01 ichard26

Generating a file and checking it in seems like the best way. That way, we can make sure it stays the same for the whole year and keep the stable style stable.

JelleZijlstra avatar Jan 08 '23 20:01 JelleZijlstra

Now char_width() depends on the generated wcwidth() table instead of unicodedata.east_asian_width(), and it's LRU-cached. I also changed the fast path condition of str_width() to address what @ZeroRin pointed out. The script to generate the table is placed in scripts/make_width_table.py, and the generated table is in black/_width_table.py.

@JelleZijlstra Could you take a look again?

dahlia avatar Jan 09 '23 16:01 dahlia

@JelleZijlstra I addressed things you pointed out. Could you take a look again?

dahlia avatar Jan 25 '23 10:01 dahlia

diff-shades is failing on pandas apparently because there are no files to analyze, cc @ichard26

JelleZijlstra avatar Feb 22 '23 07:02 JelleZijlstra

I rebased commits on the latest main and resolved conflicts (mostly from #1879). Please take a look again!

dahlia avatar Mar 16 '23 14:03 dahlia

The test suite apparently failed on the CI job. Should I try to fix it?

dahlia avatar Mar 19 '23 04:03 dahlia

@yilei kindly fixed it in #3615. I'll merge in main to hopefully get tests to pass.

JelleZijlstra avatar Mar 19 '23 14:03 JelleZijlstra