unicode-truncate
unicode-truncate copied to clipboard
feat: segment by graphemes
Before this zero length things were assumed to keep, but this is mostly only a best-effort approach. unicode-segmentation bundles up characters that belong together.
Sadly this is slower but more correct.
zhu fu/16384/end time: [98.795 µs 98.834 µs 98.883 µs]
thrpt: [158.02 MiB/s 158.09 MiB/s 158.16 MiB/s]
change:
time: [+420.90% +421.28% +421.82%] (p = 0.00 < 0.05)
thrpt: [-80.836% -80.816% -80.802%]
Performance has regressed.
Found 8 outliers among 200 measurements (4.00%)
1 (0.50%) low mild
6 (3.00%) high mild
1 (0.50%) high severe
zhu fu/16384/start time: [112.87 µs 112.98 µs 113.10 µs]
thrpt: [138.15 MiB/s 138.30 MiB/s 138.43 MiB/s]
change:
time: [+461.21% +461.73% +462.28%] (p = 0.00 < 0.05)
thrpt: [-82.215% -82.198% -82.181%]
Performance has regressed.
Found 4 outliers among 200 measurements (2.00%)
4 (2.00%) high mild
zhu fu/16384/centered time: [50.122 µs 50.177 µs 50.249 µs]
thrpt: [310.95 MiB/s 311.40 MiB/s 311.74 MiB/s]
change:
time: [+86.029% +86.268% +86.498%] (p = 0.00 < 0.05)
thrpt: [-46.380% -46.314% -46.245%]
Performance has regressed.
Found 9 outliers among 200 measurements (4.50%)
8 (4.00%) low mild
1 (0.50%) high severe
Interestingly centered is now faster than the other two by a lot. Analyzing this could lead to performance improvements for the other two too?
Interesting find: unicode-width 0.1.12 released a few days ago with this: https://github.com/unicode-rs/unicode-width/pull/41
I kinda expected the family to be width 2 then, but it doesn't seem that way. But it also doesn't start with that \u{FE0F} so I am not entirely sure what should happen or what is correct.
I saw that PR and was thinking about the same thing last week. And my concern is exactly the large performance hit. I'm a bit reluctant to merge this. (But maybe the performance isn't that important, as truncating a full 16KB text is pretty rare? Happy to be convinced.)
The input text used in the benchmark is all Chinese characters without any zero-width fun stuff. I'd love to see some benchmarking for some full emoji text.
Regarding the performance difference, truncate and truncate_start count things to keep, while truncate_centered counts things to remove. But I'm not sure how much this matters as all benchmarks are truncating to roughly half width.
And my concern is exactly the large performance hit. I'm a bit reluctant to merge this. (But maybe the performance isn't that important, as truncating a full 16KB text is pretty rare? Happy to be convinced.)
I assume most use-cases will truncate to kinda small numbers. Coming from the terminal library ratatui there are maybe 100 characters of width. Either truncation or wrapping can be used in these cases. When thinking about other places even then not that many long places are relevant. Browser dev tools truncate to the end of the line and that is also rather small.
Usages on GitHub seem to use like 25, 50, 140, something like that.
I'll note that unicode-width does not guarantee that the width of a string equals the sum of the widths of its grapheme clusters. In 1.13 (current published version), this property fails to hold only for the Old Lisu script (i.e., unlikely to be a problem in practice); in the latest master, it's also true for several other scripts, including Arabic.
Okay, let's get this merged. @EdJoPaTo could you rebase?