julia Convert String processing to DFA for improved performace

This PR changes most of the processing of the base string case over to using a DFA State Machine that can run both forward and reverse. This is similar to the DFA used #47880.

Also introduce the idea that Julia's base Strings use a format we call inclusive/invalid UTF-8 or IUTF-8. IUTF-8 Is a superset of UTF-8 and encodes how Julia will split up the code units of a string.

This PR would be simplified by merging #48568 first as it borrows methods that were used in that. Furthermore with that PR merged, there will be speedups seen in most of the important methods for string.

Performance benchmarks using ndinsmore/StringBenchmarks show performance benefits of around 20% nearly across the board with length showing a benefit of 20x for long ascii strings.

Update to include benchmarks

julia> judge(iutf_m,master_m)
6-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "length" => 3-element BenchmarkTools.BenchmarkGroup:
	  tags: []
	  "single nonASCII" => 4-element BenchmarkTools.BenchmarkGroup:
		  tags: []
		  "length 2:64" => TrialJudgement(+35.28% => regression)
		  "length 512:4096" => TrialJudgement(-83.65% => improvement)
		  "length 64:512" => TrialJudgement(-36.18% => improvement)
		  "length 4096:32768" => TrialJudgement(-93.53% => improvement)
	  "julia 1.9 source" => 4-element BenchmarkTools.BenchmarkGroup:
		  tags: []
		  "files" => TrialJudgement(-86.02% => improvement)
		  "lines" => TrialJudgement(-33.53% => improvement)
		  "files SubString" => TrialJudgement(-87.81% => improvement)
		  "lines SubString" => TrialJudgement(-30.25% => improvement)
	  "ASCII" => 4-element BenchmarkTools.BenchmarkGroup:
		  tags: []
		  "length 2:64" => TrialJudgement(-28.44% => improvement)
		  "length 512:4096" => TrialJudgement(-93.43% => improvement)
		  "length 64:512" => TrialJudgement(-81.89% => improvement)
		  "length 4096:32768" => TrialJudgement(-95.20% => improvement)
  "thisind" => 3-element BenchmarkTools.BenchmarkGroup:
	  tags: []
	  "ascii" => TrialJudgement(-15.60% => improvement)
	  "malformed" => TrialJudgement(-22.11% => improvement)
	  "unicode" => TrialJudgement(-18.80% => improvement)
  "iterate" => 3-element BenchmarkTools.BenchmarkGroup:
	  tags: []
	  "ascii" => TrialJudgement(+10.01% => invariant)
	  "malformed" => TrialJudgement(-1.08% => invariant)
	  "unicode" => TrialJudgement(-28.38% => improvement)
  "isascii" => 3-element BenchmarkTools.BenchmarkGroup:
	  tags: []
	  "single nonASCII" => 4-element BenchmarkTools.BenchmarkGroup:
		  tags: []
		  "length 2:64" => TrialJudgement(-12.35% => invariant)
		  "length 512:4096" => TrialJudgement(-7.44% => invariant)
		  "length 64:512" => TrialJudgement(-8.75% => invariant)
		  "length 4096:32768" => TrialJudgement(-3.00% => invariant)
	  "julia 1.9 source" => 4-element BenchmarkTools.BenchmarkGroup:
		  tags: []
		  "files" => TrialJudgement(-14.12% => invariant)
		  "lines" => TrialJudgement(-19.24% => improvement)
		  "files SubString" => TrialJudgement(-17.87% => improvement)
		  "lines SubString" => TrialJudgement(-11.89% => invariant)
	  "ASCII" => 4-element BenchmarkTools.BenchmarkGroup:
		  tags: []
		  "length 2:64" => TrialJudgement(-8.00% => invariant)
		  "length 512:4096" => TrialJudgement(-4.36% => invariant)
		  "length 64:512" => TrialJudgement(-3.78% => invariant)
		  "length 4096:32768" => TrialJudgement(-4.05% => invariant)
  "nextind" => 3-element BenchmarkTools.BenchmarkGroup:
	  tags: []
	  "ascii" => TrialJudgement(-73.16% => improvement)
	  "malformed" => TrialJudgement(-64.81% => improvement)
	  "unicode" => TrialJudgement(-14.73% => invariant)
  "getindex" => 4-element BenchmarkTools.BenchmarkGroup:
	  tags: []
	  "1-byte" => TrialJudgement(+0.37% => invariant)
	  "3-byte" => TrialJudgement(+5.05% => invariant)
	  "2-byte" => TrialJudgement(+2.89% => invariant)
	  "4-byte" => TrialJudgement(-0.21% => invariant)

Mar 03 '23 19:03 ndinsmore

Very cool!!

Mar 03 '23 20:03 JeffBezanson

https://github.com/JuliaLang/julia/pull/48568 is now merged so hack away!

Mar 03 '23 20:03 oscardssmith

I think that there is still a conversation to be had that was started with @StefanKarpinski as to IUTF-8 is the right name for this string specification. I would like to have some consensus before putting it in master.

Also should how to interpret the state machine be covered in the comments more?

Mar 07 '23 21:03 ndinsmore

@jakobnissen & @stevengj you guys seems to do a lot of work in the in broader string ecosystems. Do you have any benchmarks that you could run with this to see if there is a similar down stream benefit for benchmarks?

Mar 09 '23 16:03 ndinsmore

I don't have any benchmarks for this, sorry. It looks like it's length and iteration that's been improved. I would probably just get a selection of texts and benchmark iteration over it and length. I'd pick English (all ASCII), French (mostly ASCII) and Russian (little ASCII) and maybe Chinese (little ASCII), and truncate a text to various lengths, then benchmark some simple functions.

Mar 10 '23 15:03 jakobnissen

I have changed the labeling of IUTF8 to GUTF8 this is short for Generalized UTF-8. This comes after the realization that the state machine is executing the Generalized UTF-8 states. IUTF8 arises when you do the processing of invalid bytes, as Julia does.

Mar 13 '23 21:03 ndinsmore

@oscardssmith & @StefanKarpinski There are alot of performance improvements here, and no real downside is there anything that you can see which is wrong with this?

Mar 20 '23 20:03 ndinsmore

In principle this seems good to me, but I haven't actually reviewed it in detail.

Mar 20 '23 22:03 oscardssmith

@oscardssmith I rebased this to because I wanted to include the work you did in #48996, but now it is failing two test that I don't fully understand how to fix.

Error During Test at ./julia/test/strings/basic.jl:1210
  Expression evaluated to non-Boolean
  Expression: Core.Compiler.is_foldable(e) || (f, Ts)
       Value: (length, (String,))
Error in testset strings/basic:
Error During Test at ./julia/test/strings/basic.jl:1211
  Expression evaluated to non-Boolean
  Expression: Core.Compiler.is_removable_if_unused(e) || (f, Ts)
       Value: (length, (String,))

Mar 28 '23 19:03 ndinsmore

The failing tests are the ones I added in https://github.com/JuliaLang/julia/pull/48996. The problem is that our compiler completely gives up on proving termination in the presence of loops.

Mar 28 '23 20:03 oscardssmith

@nanosoldier runtests()

May 04 '23 10:05 KristofferC

Let's run a PkgEval and merge if it comes back green.

May 04 '23 10:05 KristofferC

The problem is that our compiler completely gives up on proving termination in the presence of loops.

I wonder how hard it would be to add a special case for a loop over an integer range.

May 04 '23 20:05 StefanKarpinski

Your package evaluation job has completed - possible new issues were detected. A full report can be found here.

May 05 '23 02:05 nanosoldier

The StrBase error should probably be looked at.

May 05 '23 05:05 KristofferC

Yeah, it is a bug here somewhere. With this PR:

julia> "α"[2]
'\xb1': Malformed UTF-8 (category Ma: Malformed, bad data)

It's pretty amazing there is no test for this.

May 05 '23 06:05 KristofferC

I removed the erroneous fast path. Now StrBase.jl passes.

May 05 '23 07:05 KristofferC

@nanosoldier runbenchmarks("string", vs = ":master")

May 05 '23 07:05 KristofferC

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

May 05 '23 08:05 nanosoldier

In trying to understand and fix the benchmark results a question I have is does the ==(AbstractString ,AbstractString) call end up using dynamic dispatch and have to reload the iterate method for each call? Because I could see the need to load a large constant table affecting that speed.

May 08 '23 12:05 ndinsmore

@KristofferC I don't actually see why those methods should be getting slower all of them end up calling memcmp and none of the methods that were changed.

May 08 '23 20:05 ndinsmore

I don't actually see why those methods should be getting slower all of them end up calling memcmp and none of the methods that were changed.

I don't think that is true, they basically do:

julia> str = String('A':'Z') ^ 100;

julia> @btime cmp($SubString(str), $str) # on master
  2.893 μs (1 allocation: 32 bytes)

julia> @btime cmp($SubString(str), $str) # on this branch
  3.253 μs (1 allocation: 32 bytes)

Jul 05 '23 10:07 KristofferC

Removing merge me, since it seems that @KristofferC added that a long time ago, so it is not clear if that was still valid

Jul 06 '23 20:07 vtjnash

I agree that it is slower, but I don't see any reason it should be. I believe that the cmp method just ends up calling memcmp and doesn't call any of the methods that were changed. So in trying to figure out what to fix here, there is nothing that should have caused this slowdown.

Jul 07 '23 14:07 ndinsmore

I believe that the cmp method just ends up calling memcmp and doesn't call any of the methods that were changed

Why do you believe so?

julia> str = String('A':'Z') ^ 100;

julia> @which cmp(SubString(str), str)
cmp(a::AbstractString, b::AbstractString)
     @ Base strings/basic.jl:299

points to

https://github.com/JuliaLang/julia/blob/4cce2a2603eb837e5dbf23429b4577725e3cde46/base/strings/basic.jl#L297-L306

Jul 07 '23 14:07 KristofferC

any chance this is revivable? it's moderately conflicted, but it was a pretty nice perf improvement.

May 15 '25 03:05 oscardssmith

After the rebase, the results still show benefits with minor regressions.

I think that a lot of the regressions can be fixed with some further changes.

7-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "isascii" => 4-element BenchmarkTools.BenchmarkGroup:
	  tags: []
	  "julia 1.9 source" => 4-element BenchmarkTools.BenchmarkGroup:
		  tags: []
		  "files SubString" => TrialJudgement(-0.49% => invariant)
		  "files" => TrialJudgement(+16.69% => regression)
		  "lines" => TrialJudgement(-0.11% => invariant)
		  "lines SubString" => TrialJudgement(+0.39% => invariant)
	  "single nonASCII" => 4-element BenchmarkTools.BenchmarkGroup:
		  tags: []
		  "length 64:512" => TrialJudgement(-7.08% => invariant)
		  "length 2:64" => TrialJudgement(-9.20% => invariant)
		  "length 4096:32768" => TrialJudgement(-7.40% => invariant)
		  "length 512:4096" => TrialJudgement(-4.67% => invariant)
	  "ASCII" => 4-element BenchmarkTools.BenchmarkGroup:
		  tags: []
		  "length 64:512" => TrialJudgement(-6.88% => invariant)
		  "length 2:64" => TrialJudgement(+10.15% => invariant)
		  "length 4096:32768" => TrialJudgement(+1.76% => invariant)
		  "length 512:4096" => TrialJudgement(+2.03% => invariant)
	  "Unicode" => 4-element BenchmarkTools.BenchmarkGroup:
		  tags: []
		  "length 64:512" => TrialJudgement(-23.46% => improvement)
		  "length 2:64" => TrialJudgement(-14.29% => invariant)
		  "length 4096:32768" => TrialJudgement(-1.69% => invariant)
		  "length 512:4096" => TrialJudgement(-2.55% => invariant)
  "isvalid" => 4-element BenchmarkTools.BenchmarkGroup:
	  tags: []
	  "julia 1.9 source" => 4-element BenchmarkTools.BenchmarkGroup:
		  tags: []
		  "files SubString" => TrialJudgement(+0.10% => invariant)
		  "files" => TrialJudgement(-0.81% => invariant)
		  "lines" => TrialJudgement(+0.40% => invariant)
		  "lines SubString" => TrialJudgement(+3.14% => invariant)
	  "single nonASCII" => 4-element BenchmarkTools.BenchmarkGroup:
		  tags: []
		  "length 64:512" => TrialJudgement(-0.09% => invariant)
		  "length 2:64" => TrialJudgement(-0.22% => invariant)
		  "length 4096:32768" => TrialJudgement(+0.43% => invariant)
		  "length 512:4096" => TrialJudgement(-0.22% => invariant)
	  "ASCII" => 4-element BenchmarkTools.BenchmarkGroup:
		  tags: []
		  "length 64:512" => TrialJudgement(-0.20% => invariant)
		  "length 2:64" => TrialJudgement(+0.80% => invariant)
		  "length 4096:32768" => TrialJudgement(-0.45% => invariant)
		  "length 512:4096" => TrialJudgement(+1.06% => invariant)
	  "Unicode" => 4-element BenchmarkTools.BenchmarkGroup:
		  tags: []
		  "length 64:512" => TrialJudgement(-0.36% => invariant)
		  "length 2:64" => TrialJudgement(-1.38% => invariant)
		  "length 4096:32768" => TrialJudgement(+0.02% => invariant)
		  "length 512:4096" => TrialJudgement(+0.00% => invariant)
  "length" => 4-element BenchmarkTools.BenchmarkGroup:
	  tags: []
	  "julia 1.9 source" => 4-element BenchmarkTools.BenchmarkGroup:
		  tags: []
		  "files SubString" => TrialJudgement(-78.61% => improvement)
		  "files" => TrialJudgement(-85.06% => improvement)
		  "lines" => TrialJudgement(-35.68% => improvement)
		  "lines SubString" => TrialJudgement(-12.63% => invariant)
	  "single nonASCII" => 4-element BenchmarkTools.BenchmarkGroup:
		  tags: []
		  "length 64:512" => TrialJudgement(-22.90% => improvement)
		  "length 2:64" => TrialJudgement(+33.49% => regression)
		  "length 4096:32768" => TrialJudgement(-95.13% => improvement)
		  "length 512:4096" => TrialJudgement(-84.73% => improvement)
	  "ASCII" => 4-element BenchmarkTools.BenchmarkGroup:
		  tags: []
		  "length 64:512" => TrialJudgement(-93.42% => improvement)
		  "length 2:64" => TrialJudgement(-56.31% => improvement)
		  "length 4096:32768" => TrialJudgement(-97.34% => improvement)
		  "length 512:4096" => TrialJudgement(-96.81% => improvement)
	  "Unicode" => 4-element BenchmarkTools.BenchmarkGroup:
		  tags: []
		  "length 64:512" => TrialJudgement(+135.01% => regression)
		  "length 2:64" => TrialJudgement(+83.94% => regression)
		  "length 4096:32768" => TrialJudgement(+124.23% => regression)
		  "length 512:4096" => TrialJudgement(+117.76% => regression)
  "thisind" => 3-element BenchmarkTools.BenchmarkGroup:
	  tags: []
	  "ascii" => TrialJudgement(+3.37% => invariant)
	  "unicode" => TrialJudgement(+30.63% => regression)
	  "malformed" => TrialJudgement(-25.35% => improvement)
  "nextind" => 3-element BenchmarkTools.BenchmarkGroup:
	  tags: []
	  "ascii" => TrialJudgement(+0.51% => invariant)
	  "unicode" => TrialJudgement(-15.73% => improvement)
	  "malformed" => TrialJudgement(-40.26% => improvement)
  "iterate" => 3-element BenchmarkTools.BenchmarkGroup:
	  tags: []
	  "ascii" => TrialJudgement(-5.16% => invariant)
	  "unicode" => TrialJudgement(+47.32% => regression)
	  "malformed" => TrialJudgement(-35.32% => improvement)
  "getindex" => 4-element BenchmarkTools.BenchmarkGroup:
	  tags: []
	  "4-byte" => TrialJudgement(+6.87% => invariant)
	  "3-byte" => TrialJudgement(+0.00% => invariant)
	  "1-byte" => TrialJudgement(+0.00% => invariant)
	  "2-byte" => TrialJudgement(+0.51% => invariant)

`

May 16 '25 22:05 ndinsmore

I say we do it!

May 17 '25 01:05 StefanKarpinski

There are a few steps I would like to take with this code before merging.

[ ] Go back and incorporate optimizations from the baseline code that wasn't previously included.
[ ] Update code to use memory vs arrays

May 17 '25 12:05 ndinsmore

Using Memory for lookup tables is not faster than Vector as the memory indirection seem to be compiled away, but Tuples may be (they were last I benchmarked a different application with a length 256 UInt8 LUT)

May 17 '25 12:05 jakobnissen