Convert String processing to DFA for improved performace
This PR changes most of the processing of the base string case over to using a DFA State Machine that can run both forward and reverse. This is similar to the DFA used #47880.
Also introduce the idea that Julia's base Strings use a format we call inclusive/invalid UTF-8 or IUTF-8. IUTF-8 Is a superset of UTF-8 and encodes how Julia will split up the code units of a string.
This PR would be simplified by merging #48568 first as it borrows methods that were used in that. Furthermore with that PR merged, there will be speedups seen in most of the important methods for string.
Performance benchmarks using ndinsmore/StringBenchmarks show performance benefits of around 20% nearly across the board with length showing a benefit of 20x for long ascii strings.
Update to include benchmarks
julia> judge(iutf_m,master_m)
6-element BenchmarkTools.BenchmarkGroup:
tags: []
"length" => 3-element BenchmarkTools.BenchmarkGroup:
tags: []
"single nonASCII" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"length 2:64" => TrialJudgement(+35.28% => regression)
"length 512:4096" => TrialJudgement(-83.65% => improvement)
"length 64:512" => TrialJudgement(-36.18% => improvement)
"length 4096:32768" => TrialJudgement(-93.53% => improvement)
"julia 1.9 source" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"files" => TrialJudgement(-86.02% => improvement)
"lines" => TrialJudgement(-33.53% => improvement)
"files SubString" => TrialJudgement(-87.81% => improvement)
"lines SubString" => TrialJudgement(-30.25% => improvement)
"ASCII" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"length 2:64" => TrialJudgement(-28.44% => improvement)
"length 512:4096" => TrialJudgement(-93.43% => improvement)
"length 64:512" => TrialJudgement(-81.89% => improvement)
"length 4096:32768" => TrialJudgement(-95.20% => improvement)
"thisind" => 3-element BenchmarkTools.BenchmarkGroup:
tags: []
"ascii" => TrialJudgement(-15.60% => improvement)
"malformed" => TrialJudgement(-22.11% => improvement)
"unicode" => TrialJudgement(-18.80% => improvement)
"iterate" => 3-element BenchmarkTools.BenchmarkGroup:
tags: []
"ascii" => TrialJudgement(+10.01% => invariant)
"malformed" => TrialJudgement(-1.08% => invariant)
"unicode" => TrialJudgement(-28.38% => improvement)
"isascii" => 3-element BenchmarkTools.BenchmarkGroup:
tags: []
"single nonASCII" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"length 2:64" => TrialJudgement(-12.35% => invariant)
"length 512:4096" => TrialJudgement(-7.44% => invariant)
"length 64:512" => TrialJudgement(-8.75% => invariant)
"length 4096:32768" => TrialJudgement(-3.00% => invariant)
"julia 1.9 source" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"files" => TrialJudgement(-14.12% => invariant)
"lines" => TrialJudgement(-19.24% => improvement)
"files SubString" => TrialJudgement(-17.87% => improvement)
"lines SubString" => TrialJudgement(-11.89% => invariant)
"ASCII" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"length 2:64" => TrialJudgement(-8.00% => invariant)
"length 512:4096" => TrialJudgement(-4.36% => invariant)
"length 64:512" => TrialJudgement(-3.78% => invariant)
"length 4096:32768" => TrialJudgement(-4.05% => invariant)
"nextind" => 3-element BenchmarkTools.BenchmarkGroup:
tags: []
"ascii" => TrialJudgement(-73.16% => improvement)
"malformed" => TrialJudgement(-64.81% => improvement)
"unicode" => TrialJudgement(-14.73% => invariant)
"getindex" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"1-byte" => TrialJudgement(+0.37% => invariant)
"3-byte" => TrialJudgement(+5.05% => invariant)
"2-byte" => TrialJudgement(+2.89% => invariant)
"4-byte" => TrialJudgement(-0.21% => invariant)
Very cool!!
https://github.com/JuliaLang/julia/pull/48568 is now merged so hack away!
I think that there is still a conversation to be had that was started with @StefanKarpinski as to IUTF-8 is the right name for this string specification. I would like to have some consensus before putting it in master.
Also should how to interpret the state machine be covered in the comments more?
@jakobnissen & @stevengj you guys seems to do a lot of work in the in broader string ecosystems. Do you have any benchmarks that you could run with this to see if there is a similar down stream benefit for benchmarks?
I don't have any benchmarks for this, sorry. It looks like it's length and iteration that's been improved. I would probably just get a selection of texts and benchmark iteration over it and length. I'd pick English (all ASCII), French (mostly ASCII) and Russian (little ASCII) and maybe Chinese (little ASCII), and truncate a text to various lengths, then benchmark some simple functions.
I have changed the labeling of IUTF8 to GUTF8 this is short for Generalized UTF-8. This comes after the realization that the state machine is executing the Generalized UTF-8 states. IUTF8 arises when you do the processing of invalid bytes, as Julia does.
@oscardssmith & @StefanKarpinski There are alot of performance improvements here, and no real downside is there anything that you can see which is wrong with this?
In principle this seems good to me, but I haven't actually reviewed it in detail.
@oscardssmith I rebased this to because I wanted to include the work you did in #48996, but now it is failing two test that I don't fully understand how to fix.
Error During Test at ./julia/test/strings/basic.jl:1210
Expression evaluated to non-Boolean
Expression: Core.Compiler.is_foldable(e) || (f, Ts)
Value: (length, (String,))
Error in testset strings/basic:
Error During Test at ./julia/test/strings/basic.jl:1211
Expression evaluated to non-Boolean
Expression: Core.Compiler.is_removable_if_unused(e) || (f, Ts)
Value: (length, (String,))
The failing tests are the ones I added in https://github.com/JuliaLang/julia/pull/48996. The problem is that our compiler completely gives up on proving termination in the presence of loops.
@nanosoldier runtests()
Let's run a PkgEval and merge if it comes back green.
The problem is that our compiler completely gives up on proving termination in the presence of loops.
I wonder how hard it would be to add a special case for a loop over an integer range.
Your package evaluation job has completed - possible new issues were detected. A full report can be found here.
The StrBase error should probably be looked at.
Yeah, it is a bug here somewhere. With this PR:
julia> "α"[2]
'\xb1': Malformed UTF-8 (category Ma: Malformed, bad data)
It's pretty amazing there is no test for this.
I removed the erroneous fast path. Now StrBase.jl passes.
@nanosoldier runbenchmarks("string", vs = ":master")
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.
In trying to understand and fix the benchmark results a question I have is does the ==(AbstractString ,AbstractString) call end up using dynamic dispatch and have to reload the iterate method for each call? Because I could see the need to load a large constant table affecting that speed.
@KristofferC I don't actually see why those methods should be getting slower all of them end up calling memcmp and none of the methods that were changed.
I don't actually see why those methods should be getting slower all of them end up calling memcmp and none of the methods that were changed.
I don't think that is true, they basically do:
julia> str = String('A':'Z') ^ 100;
julia> @btime cmp($SubString(str), $str) # on master
2.893 μs (1 allocation: 32 bytes)
julia> @btime cmp($SubString(str), $str) # on this branch
3.253 μs (1 allocation: 32 bytes)
Removing merge me, since it seems that @KristofferC added that a long time ago, so it is not clear if that was still valid
I agree that it is slower, but I don't see any reason it should be. I believe that the cmp method just ends up calling memcmp and doesn't call any of the methods that were changed. So in trying to figure out what to fix here, there is nothing that should have caused this slowdown.
I believe that the cmp method just ends up calling memcmp and doesn't call any of the methods that were changed
Why do you believe so?
julia> str = String('A':'Z') ^ 100;
julia> @which cmp(SubString(str), str)
cmp(a::AbstractString, b::AbstractString)
@ Base strings/basic.jl:299
points to
https://github.com/JuliaLang/julia/blob/4cce2a2603eb837e5dbf23429b4577725e3cde46/base/strings/basic.jl#L297-L306
any chance this is revivable? it's moderately conflicted, but it was a pretty nice perf improvement.
After the rebase, the results still show benefits with minor regressions.
I think that a lot of the regressions can be fixed with some further changes.
7-element BenchmarkTools.BenchmarkGroup:
tags: []
"isascii" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"julia 1.9 source" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"files SubString" => TrialJudgement(-0.49% => invariant)
"files" => TrialJudgement(+16.69% => regression)
"lines" => TrialJudgement(-0.11% => invariant)
"lines SubString" => TrialJudgement(+0.39% => invariant)
"single nonASCII" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"length 64:512" => TrialJudgement(-7.08% => invariant)
"length 2:64" => TrialJudgement(-9.20% => invariant)
"length 4096:32768" => TrialJudgement(-7.40% => invariant)
"length 512:4096" => TrialJudgement(-4.67% => invariant)
"ASCII" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"length 64:512" => TrialJudgement(-6.88% => invariant)
"length 2:64" => TrialJudgement(+10.15% => invariant)
"length 4096:32768" => TrialJudgement(+1.76% => invariant)
"length 512:4096" => TrialJudgement(+2.03% => invariant)
"Unicode" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"length 64:512" => TrialJudgement(-23.46% => improvement)
"length 2:64" => TrialJudgement(-14.29% => invariant)
"length 4096:32768" => TrialJudgement(-1.69% => invariant)
"length 512:4096" => TrialJudgement(-2.55% => invariant)
"isvalid" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"julia 1.9 source" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"files SubString" => TrialJudgement(+0.10% => invariant)
"files" => TrialJudgement(-0.81% => invariant)
"lines" => TrialJudgement(+0.40% => invariant)
"lines SubString" => TrialJudgement(+3.14% => invariant)
"single nonASCII" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"length 64:512" => TrialJudgement(-0.09% => invariant)
"length 2:64" => TrialJudgement(-0.22% => invariant)
"length 4096:32768" => TrialJudgement(+0.43% => invariant)
"length 512:4096" => TrialJudgement(-0.22% => invariant)
"ASCII" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"length 64:512" => TrialJudgement(-0.20% => invariant)
"length 2:64" => TrialJudgement(+0.80% => invariant)
"length 4096:32768" => TrialJudgement(-0.45% => invariant)
"length 512:4096" => TrialJudgement(+1.06% => invariant)
"Unicode" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"length 64:512" => TrialJudgement(-0.36% => invariant)
"length 2:64" => TrialJudgement(-1.38% => invariant)
"length 4096:32768" => TrialJudgement(+0.02% => invariant)
"length 512:4096" => TrialJudgement(+0.00% => invariant)
"length" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"julia 1.9 source" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"files SubString" => TrialJudgement(-78.61% => improvement)
"files" => TrialJudgement(-85.06% => improvement)
"lines" => TrialJudgement(-35.68% => improvement)
"lines SubString" => TrialJudgement(-12.63% => invariant)
"single nonASCII" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"length 64:512" => TrialJudgement(-22.90% => improvement)
"length 2:64" => TrialJudgement(+33.49% => regression)
"length 4096:32768" => TrialJudgement(-95.13% => improvement)
"length 512:4096" => TrialJudgement(-84.73% => improvement)
"ASCII" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"length 64:512" => TrialJudgement(-93.42% => improvement)
"length 2:64" => TrialJudgement(-56.31% => improvement)
"length 4096:32768" => TrialJudgement(-97.34% => improvement)
"length 512:4096" => TrialJudgement(-96.81% => improvement)
"Unicode" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"length 64:512" => TrialJudgement(+135.01% => regression)
"length 2:64" => TrialJudgement(+83.94% => regression)
"length 4096:32768" => TrialJudgement(+124.23% => regression)
"length 512:4096" => TrialJudgement(+117.76% => regression)
"thisind" => 3-element BenchmarkTools.BenchmarkGroup:
tags: []
"ascii" => TrialJudgement(+3.37% => invariant)
"unicode" => TrialJudgement(+30.63% => regression)
"malformed" => TrialJudgement(-25.35% => improvement)
"nextind" => 3-element BenchmarkTools.BenchmarkGroup:
tags: []
"ascii" => TrialJudgement(+0.51% => invariant)
"unicode" => TrialJudgement(-15.73% => improvement)
"malformed" => TrialJudgement(-40.26% => improvement)
"iterate" => 3-element BenchmarkTools.BenchmarkGroup:
tags: []
"ascii" => TrialJudgement(-5.16% => invariant)
"unicode" => TrialJudgement(+47.32% => regression)
"malformed" => TrialJudgement(-35.32% => improvement)
"getindex" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"4-byte" => TrialJudgement(+6.87% => invariant)
"3-byte" => TrialJudgement(+0.00% => invariant)
"1-byte" => TrialJudgement(+0.00% => invariant)
"2-byte" => TrialJudgement(+0.51% => invariant)
`
I say we do it!
There are a few steps I would like to take with this code before merging.
- [ ] Go back and incorporate optimizations from the baseline code that wasn't previously included.
- [ ] Update code to use memory vs arrays
Using Memory for lookup tables is not faster than Vector as the memory indirection seem to be compiled away, but Tuples may be (they were last I benchmarked a different application with a length 256 UInt8 LUT)