uniseg icon indicating copy to clipboard operation
uniseg copied to clipboard

optimize data structures for 5.5x speedup

Open charlievieth opened this issue 2 years ago • 2 comments

This commit improves the layout of the "parser transition state" and "grapheme code point" data structures; changes them to use structs instead of arrays; and adds an ASCII and Extended-ASCII fast path for the property() function.

Overall, this results in a ~5.5x speedup for Graphemes.Next().

goos: darwin
goarch: arm64
pkg: github.com/rivo/uniseg

name                        old time/op    new time/op    delta
GraphemesClass-10             2.87µs ± 0%    0.76µs ± 0%  -73.50%  (p=0.000 n=10+8)
GraphemesNext-10              2.55µs ± 0%    0.47µs ± 0%  -81.75%  (p=0.000 n=8+9)
GraphemesFunctionBytes-10     0.64ns ± 0%    0.64ns ± 0%     ~     (p=0.344 n=9+10)
GraphemesFunctionString-10    0.64ns ± 0%    0.64ns ± 0%     ~     (p=0.190 n=10+9)

name                        old alloc/op   new alloc/op   delta
GraphemesClass-10               944B ± 0%      944B ± 0%     ~     (all equal)
GraphemesNext-10               0.00B          0.00B          ~     (all equal)
GraphemesFunctionBytes-10      0.00B          0.00B          ~     (all equal)
GraphemesFunctionString-10     0.00B          0.00B          ~     (all equal)

name                        old allocs/op  new allocs/op  delta
GraphemesClass-10               3.00 ± 0%      3.00 ± 0%     ~     (all equal)
GraphemesNext-10                0.00           0.00          ~     (all equal)
GraphemesFunctionBytes-10       0.00           0.00          ~     (all equal)
GraphemesFunctionString-10      0.00           0.00          ~     (all equal)

Note: The GraphemesNext benchmark was backported to d6773baf for these comparisons.

charlievieth avatar Jul 22 '22 01:07 charlievieth

I'll write this here as there is no issue to go along with the PR's you submitted and I'd like to note a few general things:

  • First of all, thank you for contributing to this project!
  • I'm currently in the process of making major changes to uniseg. (You can see some of the process in the wordbreak branch and there will be other branches, too.)
  • These changes are so extensive, you could almost call them a rewrite of the package.
  • Therefore, by now, many of the PR's you submitted are clashing with those changes. It's not your fault, I didn't tell anyone that I'm working on this right now.
  • I should say I prefer if people open issues first, before submitting PR's. I outline this here. Again, not your fault, because that guide is not in uniseg (mostly because it doesn't get a lot of contributions so there's less guidance from my side).

In any case, once I've finish my current work on uniseg, you're welcome to change your PR's accordingly — although again, I would prefer that we discuss the proposed changes before you spend time writing code that may not end up being used.

rivo avatar Jul 22 '22 09:07 rivo

I should also note that if you want to keep working on this PR after the changes mentioned above, it should be broken into multiple PR's so we can discuss/approve each of these changes separately. For example, there are changes in the generators (not really relevant), the rule map has been reordered, sacrificing readability (very likely to be rejected), some suggestions are not helpful (e.g. turning i / 2 into int((uint) i >> 1)), and others are certainly useful. But with all of them in one PR, I can only merge, close, or we go back and forth on each individual change with review comments. I probably don't have time for the latter.

rivo avatar Jul 23 '22 12:07 rivo

This PR doesn't apply anymore at this point.

rivo avatar Jul 22 '23 10:07 rivo