unicode-truncate feat: segment by graphemes

Before this zero length things were assumed to keep, but this is mostly only a best-effort approach. unicode-segmentation bundles up characters that belong together.

Sadly this is slower but more correct.

zhu fu/16384/end        time:   [98.795 µs 98.834 µs 98.883 µs]
                        thrpt:  [158.02 MiB/s 158.09 MiB/s 158.16 MiB/s]
                 change:
                        time:   [+420.90% +421.28% +421.82%] (p = 0.00 < 0.05)
                        thrpt:  [-80.836% -80.816% -80.802%]
                        Performance has regressed.
Found 8 outliers among 200 measurements (4.00%)
  1 (0.50%) low mild
  6 (3.00%) high mild
  1 (0.50%) high severe
zhu fu/16384/start      time:   [112.87 µs 112.98 µs 113.10 µs]
                        thrpt:  [138.15 MiB/s 138.30 MiB/s 138.43 MiB/s]
                 change:
                        time:   [+461.21% +461.73% +462.28%] (p = 0.00 < 0.05)
                        thrpt:  [-82.215% -82.198% -82.181%]
                        Performance has regressed.
Found 4 outliers among 200 measurements (2.00%)
  4 (2.00%) high mild
zhu fu/16384/centered   time:   [50.122 µs 50.177 µs 50.249 µs]
                        thrpt:  [310.95 MiB/s 311.40 MiB/s 311.74 MiB/s]
                 change:
                        time:   [+86.029% +86.268% +86.498%] (p = 0.00 < 0.05)
                        thrpt:  [-46.380% -46.314% -46.245%]
                        Performance has regressed.
Found 9 outliers among 200 measurements (4.50%)
  8 (4.00%) low mild
  1 (0.50%) high severe

Interestingly centered is now faster than the other two by a lot. Analyzing this could lead to performance improvements for the other two too?

May 02 '24 12:05 EdJoPaTo

Interesting find: unicode-width 0.1.12 released a few days ago with this: https://github.com/unicode-rs/unicode-width/pull/41

I kinda expected the family to be width 2 then, but it doesn't seem that way. But it also doesn't start with that \u{FE0F} so I am not entirely sure what should happen or what is correct.

May 02 '24 13:05 EdJoPaTo

I saw that PR and was thinking about the same thing last week. And my concern is exactly the large performance hit. I'm a bit reluctant to merge this. (But maybe the performance isn't that important, as truncating a full 16KB text is pretty rare? Happy to be convinced.)

The input text used in the benchmark is all Chinese characters without any zero-width fun stuff. I'd love to see some benchmarking for some full emoji text.

Regarding the performance difference, truncate and truncate_start count things to keep, while truncate_centered counts things to remove. But I'm not sure how much this matters as all benchmarks are truncating to roughly half width.

May 05 '24 23:05 Aetf

And my concern is exactly the large performance hit. I'm a bit reluctant to merge this. (But maybe the performance isn't that important, as truncating a full 16KB text is pretty rare? Happy to be convinced.)

I assume most use-cases will truncate to kinda small numbers. Coming from the terminal library ratatui there are maybe 100 characters of width. Either truncation or wrapping can be used in these cases. When thinking about other places even then not that many long places are relevant. Browser dev tools truncate to the end of the line and that is also rather small.

Usages on GitHub seem to use like 25, 50, 140, something like that.

May 06 '24 18:05 EdJoPaTo

I'll note that unicode-width does not guarantee that the width of a string equals the sum of the widths of its grapheme clusters. In 1.13 (current published version), this property fails to hold only for the Old Lisu script (i.e., unlikely to be a problem in practice); in the latest master, it's also true for several other scripts, including Arabic.

Jun 17 '24 12:06 Jules-Bertholet

Okay, let's get this merged. @EdJoPaTo could you rebase?

Jun 24 '24 01:06 Aetf

unicode-truncate unicode-truncate copied to clipboard

feat: segment by graphemes

unicode-truncate
unicode-truncate copied to clipboard