crystal
crystal copied to clipboard
Optimize `String#valid_encoding?`
This PR implements one of the algorithms from #11873.
The loop can be made even faster by unrolling, which is not done here yet. Here is a benchmark comparing 4 different implementations of String#valid_encoding?
:
-
old
: The existingChar::Reader
-based loop. -
old inlined
: As above, but with this modification applied toChar::Reader#next_char
. -
new
: This PR. -
new unrolled
: Adds the following unrolling code, whereUNROLL = 64
for the benchmark:
Thewhile s + UNROLL <= e {% for i in 0...UNROLL %} state = table[s[{{ i }}]].unsafe_shr(state & 0x3F) {% end %} return false if state & 0x3F == 6 s += UNROLL end
return
statement must be outside the inner loop, otherwise the unrolling would become only marginally faster than this PR.
The results are:
old 936.05 ( 1.07ms) (± 2.37%) 0.0B/op 2.93× slower
old inlined 1.57k (638.75µs) (± 3.85%) 0.0B/op 1.75× slower
new 1.88k (532.38µs) (± 7.53%) 0.0B/op 1.46× slower
new unrolled 2.75k (364.02µs) (± 3.42%) 0.0B/op fastest
With --mcpu=native
this becomes:
old 931.93 ( 1.07ms) (± 3.02%) 0.0B/op 5.19× slower
old inlined 1.57k (635.39µs) (± 3.11%) 0.0B/op 3.07× slower
new 2.18k (458.16µs) (± 6.94%) 0.0B/op 2.22× slower
new unrolled 4.84k (206.70µs) (± 3.54%) 0.0B/op fastest
As explained in the linked issue, this improvement relies on the shrx
instruction from the BMI2 extension. The Char::Reader
implementations are barely affected.
Note that if the first byte of a sufficiently long string is already invalid, the inner loop must still consume the rest of the UNROLL
-byte chunk before the method returns. This is what happens if the benchmark had s = "\x80" * N
instead:
old 78.94M ( 12.67ns) (± 6.46%) 0.0B/op 3.56× slower
old inlined 280.90M ( 3.56ns) (± 5.88%) 0.0B/op fastest
new 275.24M ( 3.63ns) (± 8.38%) 0.0B/op 1.02× slower
new unrolled 20.22M ( 49.45ns) (± 4.52%) 0.0B/op 13.89× slower
Because of this I am not sure if we should unroll the loop here.
Unrolling looks good to me. It may have a negative impact on some cases. But the magnitude of the negative impact is dwarfed by the positive impact (10s of nanoseconds vs 100s of microseconds). And the typically expected case is a valid encoding. Finding an invalid byte which would allow to abort early is unexpected.
Perhaps a way to optimize for the early failing case would be to switch positions of the two loops and process the individual bytes first, before the unrolled loop. For a valid encoding there should be no difference because it needs to iterate through everything anyways. If there's an invalid encoding at the start, it would cause an early fail.
However, I'm also wondering if it would make a significant difference to move the return statement out of the individual byte loop. This would be optimizing for the likely case of a valid encoding.
Doing the byte loop before the unrolled loop does not matter, because if the first invalid byte occurs within the first unrolled chunk, then the method now visits bytes.size % UNROLL + UNROLL
bytes instead of UNROLL
bytes; the average iteration count would ultimately be the same.
Moving out the return
in the byte loop did little for valid strings, but now the method must also iterate through all bytes for invalid strings shorter than UNROLL
bytes. So I think the benchmark's version is good enough.
What I wonder most is whether unrolling still gives a performance boost on AArch64.
the average iteration count would ultimately be the same.
I'm not sure about that when accounting for the conditional probability that when there's an invalid byte somewhere in the slice (including in some of the last bytes), there's a higher probability of having more invalid bytes somewhere else (including in some of the first bytes). Purely theoretical thinking though and I don't think it's worth it digging deeper into this to maybe squeeze out a little bit more for use cases.
So I think the benchmark's version is good enough.
Definitely! 👍 ☺️ Thank you very much for that.
What I wonder most is whether unrolling still gives a performance boost on AArch64.
I can try that.
Sorry for the delay. I finally got around running the benchmark on a Raspberry Pi 4 Model B Rev 1.4.
$ uname -a
Linux 2977d8bfd403 5.15.32-v8+ #1538 SMP PREEMPT Thu Mar 31 19:40:39 BST 2022 aarch64 Linux
$ cat /etc/os-release | grep PRETTY_NAME
PRETTY_NAME="Alpine Linux v3.16"
old 387.73 ( 2.58ms) (± 0.87%) 0.0B/op 4.58× slower
old inlined 713.18 ( 1.40ms) (± 0.21%) 0.0B/op 2.49× slower
new 1.18k (849.85µs) (± 0.55%) 0.0B/op 1.51× slower
new unrolled 1.78k (562.88µs) (± 0.53%) 0.0B/op fastest
--mcpu=native
doesn't seem to have any effect on any of the implementations:
old 398.40 ( 2.51ms) (± 0.19%) 0.0B/op 4.49× slower
old inlined 688.86 ( 1.45ms) (± 0.47%) 0.0B/op 2.60× slower
new 1.18k (848.07µs) (± 0.27%) 0.0B/op 1.52× slower
new unrolled 1.79k (559.40µs) (± 0.09%) 0.0B/op fastest
@HertzDevil The implementation still needs to be integrated into valid_encoding?
, right?
Crystal 1.0.0 cannot handle UInt64
s inside the dfa_state
macro until #10713.
:thinking: