crystal icon indicating copy to clipboard operation
crystal copied to clipboard

Optimize `String#valid_encoding?`

Open HertzDevil opened this issue 2 years ago • 6 comments

This PR implements one of the algorithms from #11873.

The loop can be made even faster by unrolling, which is not done here yet. Here is a benchmark comparing 4 different implementations of String#valid_encoding?:

  • old: The existing Char::Reader-based loop.
  • old inlined: As above, but with this modification applied to Char::Reader#next_char.
  • new: This PR.
  • new unrolled: Adds the following unrolling code, where UNROLL = 64 for the benchmark:
    while s + UNROLL <= e
      {% for i in 0...UNROLL %}
        state = table[s[{{ i }}]].unsafe_shr(state & 0x3F)
      {% end %}
      return false if state & 0x3F == 6
      s += UNROLL
    end
    
    The return statement must be outside the inner loop, otherwise the unrolling would become only marginally faster than this PR.

The results are:

         old 936.05  (  1.07ms) (± 2.37%)  0.0B/op   2.93× slower
 old inlined   1.57k (638.75µs) (± 3.85%)  0.0B/op   1.75× slower
         new   1.88k (532.38µs) (± 7.53%)  0.0B/op   1.46× slower
new unrolled   2.75k (364.02µs) (± 3.42%)  0.0B/op        fastest

With --mcpu=native this becomes:

         old 931.93  (  1.07ms) (± 3.02%)  0.0B/op   5.19× slower
 old inlined   1.57k (635.39µs) (± 3.11%)  0.0B/op   3.07× slower
         new   2.18k (458.16µs) (± 6.94%)  0.0B/op   2.22× slower
new unrolled   4.84k (206.70µs) (± 3.54%)  0.0B/op        fastest

As explained in the linked issue, this improvement relies on the shrx instruction from the BMI2 extension. The Char::Reader implementations are barely affected.

Note that if the first byte of a sufficiently long string is already invalid, the inner loop must still consume the rest of the UNROLL-byte chunk before the method returns. This is what happens if the benchmark had s = "\x80" * N instead:

         old  78.94M ( 12.67ns) (± 6.46%)  0.0B/op   3.56× slower
 old inlined 280.90M (  3.56ns) (± 5.88%)  0.0B/op        fastest
         new 275.24M (  3.63ns) (± 8.38%)  0.0B/op   1.02× slower
new unrolled  20.22M ( 49.45ns) (± 4.52%)  0.0B/op  13.89× slower

Because of this I am not sure if we should unroll the loop here.

HertzDevil avatar Jun 21 '22 23:06 HertzDevil

Unrolling looks good to me. It may have a negative impact on some cases. But the magnitude of the negative impact is dwarfed by the positive impact (10s of nanoseconds vs 100s of microseconds). And the typically expected case is a valid encoding. Finding an invalid byte which would allow to abort early is unexpected.

straight-shoota avatar Jun 22 '22 11:06 straight-shoota

Perhaps a way to optimize for the early failing case would be to switch positions of the two loops and process the individual bytes first, before the unrolled loop. For a valid encoding there should be no difference because it needs to iterate through everything anyways. If there's an invalid encoding at the start, it would cause an early fail.

However, I'm also wondering if it would make a significant difference to move the return statement out of the individual byte loop. This would be optimizing for the likely case of a valid encoding.

straight-shoota avatar Jun 22 '22 11:06 straight-shoota

Doing the byte loop before the unrolled loop does not matter, because if the first invalid byte occurs within the first unrolled chunk, then the method now visits bytes.size % UNROLL + UNROLL bytes instead of UNROLL bytes; the average iteration count would ultimately be the same.

Moving out the return in the byte loop did little for valid strings, but now the method must also iterate through all bytes for invalid strings shorter than UNROLL bytes. So I think the benchmark's version is good enough.

What I wonder most is whether unrolling still gives a performance boost on AArch64.

HertzDevil avatar Jun 22 '22 20:06 HertzDevil

the average iteration count would ultimately be the same.

I'm not sure about that when accounting for the conditional probability that when there's an invalid byte somewhere in the slice (including in some of the last bytes), there's a higher probability of having more invalid bytes somewhere else (including in some of the first bytes). Purely theoretical thinking though and I don't think it's worth it digging deeper into this to maybe squeeze out a little bit more for use cases.

So I think the benchmark's version is good enough.

Definitely! 👍 ☺️ Thank you very much for that.

What I wonder most is whether unrolling still gives a performance boost on AArch64.

I can try that.

straight-shoota avatar Jun 22 '22 22:06 straight-shoota

Sorry for the delay. I finally got around running the benchmark on a Raspberry Pi 4 Model B Rev 1.4.

$ uname -a
Linux 2977d8bfd403 5.15.32-v8+ #1538 SMP PREEMPT Thu Mar 31 19:40:39 BST 2022 aarch64 Linux
$ cat /etc/os-release | grep PRETTY_NAME
PRETTY_NAME="Alpine Linux v3.16"
         old 387.73  (  2.58ms) (± 0.87%)  0.0B/op   4.58× slower
 old inlined 713.18  (  1.40ms) (± 0.21%)  0.0B/op   2.49× slower
         new   1.18k (849.85µs) (± 0.55%)  0.0B/op   1.51× slower
new unrolled   1.78k (562.88µs) (± 0.53%)  0.0B/op        fastest

--mcpu=native doesn't seem to have any effect on any of the implementations:

         old 398.40  (  2.51ms) (± 0.19%)  0.0B/op   4.49× slower
 old inlined 688.86  (  1.45ms) (± 0.47%)  0.0B/op   2.60× slower
         new   1.18k (848.07µs) (± 0.27%)  0.0B/op   1.52× slower
new unrolled   1.79k (559.40µs) (± 0.09%)  0.0B/op        fastest

straight-shoota avatar Jul 27 '22 15:07 straight-shoota

@HertzDevil The implementation still needs to be integrated into valid_encoding?, right?

straight-shoota avatar Aug 01 '22 23:08 straight-shoota

Crystal 1.0.0 cannot handle UInt64s inside the dfa_state macro until #10713.

:thinking:

HertzDevil avatar Aug 13 '22 22:08 HertzDevil