jetscii
jetscii copied to clipboard
Improve substring search performance for corner cases
I've subjected jetscii 0.3.1 to a benchmark suite of just a few pathological examples for substring search.
_jetscii benches very well_, there's just evidence of pathlogical cases
Result (may be noisy, i.e. individual benchmarks may be flukes)
-
find
-str::find
with&str
pattern -
rfind
-str::rfind
with&str
pattern -
jetscii_find
str::find
withSubstring
pattern -
pcmp_find
experimental function twsearch::pcmp::find, an incomplete, incorrect two-way search -
regex_find
using regex
test bad_naive::find ... bench: 274 ns/iter (+/- 19) = 270 MB/s
test bad_naive::jetscii_find ... bench: 504 ns/iter (+/- 118) = 146 MB/s
test bad_naive::pcmp_find ... bench: 96 ns/iter (+/- 1) = 770 MB/s
test bad_naive::regex_find ... bench: 1,077 ns/iter (+/- 89) = 68 MB/s
test bad_naive::rfind ... bench: 231 ns/iter (+/- 16) = 320 MB/s
test bad_naive_reversed::find ... bench: 175 ns/iter (+/- 9) = 422 MB/s
test bad_naive_reversed::jetscii_find ... bench: 54 ns/iter (+/- 4) = 1370 MB/s
test bad_naive_reversed::pcmp_find ... bench: 43 ns/iter (+/- 4) = 1720 MB/s
test bad_naive_reversed::regex_find ... bench: 36 ns/iter (+/- 3) = 2055 MB/s
test bad_naive_reversed::rfind ... bench: 367 ns/iter (+/- 54) = 201 MB/s
test pathological_aa_100k::find ... bench: 3,084 ns/iter (+/- 149) = 32425 MB/s
test pathological_aa_100k::jetscii_find ... bench: 24,116 ns/iter (+/- 1,041) = 4146 MB/s
test pathological_aa_100k::pcmp_find ... bench: 27,241 ns/iter (+/- 2,031) = 3670 MB/s
test pathological_aa_100k::regex_find ... bench: 4,453 ns/iter (+/- 262) = 22456 MB/s
test pathological_aa_100k::rfind ... bench: 2,824 ns/iter (+/- 207) = 35410 MB/s
test pathological_aab_100k::find ... bench: 889,518 ns/iter (+/- 55,787) = 337 MB/s
test pathological_aab_100k::jetscii_find ... bench: 265,300 ns/iter (+/- 15,265) = 1130 MB/s
test pathological_aab_100k::pcmp_find ... bench: 183,135 ns/iter (+/- 11,808) = 1638 MB/s
test pathological_aab_100k::regex_find ... bench: 3,750,744 ns/iter (+/- 392,444) = 79 MB/s
test pathological_aab_100k::rfind ... bench: 859,327 ns/iter (+/- 104,678) = 348 MB/s
test pathological_two_way_10k::find ... bench: 107,896 ns/iter (+/- 48,497) = 278 MB/s
test pathological_two_way_10k::jetscii_find ... bench: 7,406 ns/iter (+/- 1,072) = 4050 MB/s
test pathological_two_way_10k::pcmp_find ... bench: 19,444 ns/iter (+/- 22,302) = 1542 MB/s
test pathological_two_way_10k::regex_find ... bench: 1,329 ns/iter (+/- 2,413) = 22573 MB/s
test pathological_two_way_10k::rfind ... bench: 2,463 ns/iter (+/- 4,146) = 12180 MB/s
test pathological_two_way_20k::find ... bench: 305,465 ns/iter (+/- 505,176) = 327 MB/s
test pathological_two_way_20k::jetscii_find ... bench: 25,179 ns/iter (+/- 39,142) = 3971 MB/s
test pathological_two_way_20k::pcmp_find ... bench: 64,983 ns/iter (+/- 113,250) = 1538 MB/s
test pathological_two_way_20k::regex_find ... bench: 6,175 ns/iter (+/- 12,215) = 16194 MB/s
test pathological_two_way_20k::rfind ... bench: 3,673 ns/iter (+/- 6,390) = 27225 MB/s
test pathological_two_way_20k_reversed::find ... bench: 2,492 ns/iter (+/- 4,523) = 24077 MB/s
test pathological_two_way_20k_reversed::jetscii_find ... bench: 45,084 ns/iter (+/- 80,989) = 1330 MB/s
test pathological_two_way_20k_reversed::pcmp_find ... bench: 39,623 ns/iter (+/- 65,178) = 1514 MB/s
test pathological_two_way_20k_reversed::regex_find ... bench: 406,637 ns/iter (+/- 730,079) = 147 MB/s
test pathological_two_way_20k_reversed::rfind ... bench: 155,639 ns/iter (+/- 284,944) = 385 MB/s
test short_long::find ... bench: 3,924 ns/iter (+/- 4,175) = 650 MB/s
test short_long::jetscii_find ... bench: 676 ns/iter (+/- 1,291) = 3773 MB/s
test short_long::pcmp_find ... bench: 810 ns/iter (+/- 1,531) = 3149 MB/s
test short_long::regex_find ... bench: 2,821 ns/iter (+/- 748) = 904 MB/s
test short_long::rfind ... bench: 2,253 ns/iter (+/- 3,445) = 1132 MB/s
test short_short::find ... bench: 68 ns/iter (+/- 119) = 823 MB/s
test short_short::jetscii_find ... bench: 109 ns/iter (+/- 127) = 513 MB/s
test short_short::pcmp_find ... bench: 43 ns/iter (+/- 69) = 1302 MB/s
test short_short::regex_find ... bench: 54 ns/iter (+/- 122) = 1037 MB/s
test short_short::rfind ... bench: 125 ns/iter (+/- 160) = 448 MB/s
Conclusions:
- jetscii 0.3.1 has algorithmic trouble, probably O(n²) problematics, see testcase
bad_naive
- Both jetscii and pcmp are losing to the byteset optimization in
libstd
's StrSearcher, see testcasepathological_aa_100k
I realized the bad_naive
testcase was a bit short -- the needle below 16 bytes. A proper testcase is a worse pathological case for jetscii!
haystack: (0..100_000).map(|_| "a").collect::<String>()
needle: (0..100).map(|_| "a").collect::<String>() + "b"
test bad_naive_long::find ... bench: 299,292 ns/iter (+/- 37,567) = 334 MB/s
test bad_naive_long::jetscii_find ... bench: 10,303,063 ns/iter (+/- 194,054) = 9 MB/s
test bad_naive_long::pcmp_find ... bench: 23,807 ns/iter (+/- 225) = 4200 MB/s
as consolation, regex is even worse..
Awesome, thanks for this! I'm certainly not surprised that there are pathological cases, but having them pointed out in test cases is even nicer.
One thing I assume will help is to not increment by one byte on a false positive. Any highlights of normal things to avoid before I read your tests and the standard lib implementation?
9 MB/s
Ouch. Lots to fix!
To avoid all algorithmic trouble, employ a substring search algorithm, i.e. some system to make sure it advances enough per trial. Here's an nice overview http://www-igm.univ-mlv.fr/~lecroq/string/
I don't know enough to have specific advice. I'm obviously trying to impl the two way algorithm with pcmp_find, but I'm not sure it's the best algorithm to pick for SSE4.2 substring search.
The MB/s is bytes of haystack searched per second, of course. Most of the testcases don't match, so they effectively search it all.
One thought is that the pcmpestrm
instruction should be able to tell which byte fails strcmp, which I believe can be used to stride further.
absolutely, that's what you need to to for the two way algorithm.
Without the factorization that two-way does of the needle, I don't think you can step to the failed match position in general.
Not sure I can recommend diving into string search algorithms. :smile: My revival PR of two-way for libstd took me a week of work, and I didn't even implement it from scratch
That said, I think the pcmp_find is part way there. But I'd consider a different algorithm entirely maybe. Maybe Boyer-Moore-Horspool?
a different algorithm
Yeah, I'll skim through and see if anything jumps out as a key match. One nice thing is that you and I can collaborate on ideas and compete on implementations, and everyone wins!
Note that the byteset optimization is why str::find
is so fast in the substring benchmark in your Readme. If you add an a
anywhere in the needle, that optimization won't work.
Have you seen strstr_sse42 in http://www.strchr.com/strcmp_and_strlen_using_sse_4.2?
@ArtemGr that's a quadratic algorithm too (it's a naive search)
; continue searching from the next byte
I wondered if the next byte meant the one after the 15 already checked. Duh, at least I tried. )
BTW, you ever run those benchmars with native optimizations on (target=native)?
I didn't think about it. The pcmpestri instruction is emitted anyway since it's in an explicit asm!()
block, so I'm not sure what could improve.
The pcmpestri instruction is emitted anyway since it's in an explicit asm!() block, so I'm not sure what could improve.
The SSE version won't improve, but the non-SSE versions might. We could see how LLVM fares against manual assembly. Without "target=native" it can't compete.
That's true. Most of libcore's substring search (str::find) is bounds checked, and that often makes it impossible for the optimizer to do autovectorization. Crucially the things in the tightest loop in str::find
are not very amenable to vectorization.
I think it's harder to vectorize the more advanced string search algorithm (by hand or by compiler). I've been experimenting with that, (pcmp_find) but my question is which string search algorithm is the best for this. Since I posted this my pcmp_find experiment has actually progressed, it's an improvement over str::find in every case now, but of course jetscii can beat it depending on the case.
In the easy cases, the simpler algorithm wins, in the tricky cases the string search algorithm should win.
I'm sure I've tried target-cpu=native with that at some point, and I'll do it again now.
when I run the benchmarks in the "twsearch" repo now, I see no difference in the twoway_find algorithm with native or not, unfortuately. It could probably be improved to be nicer for optimization some way, but the parts of the algorithm that are in the tightest loop (those parts here) are not so easy. The byteset check is random access of a single byte, and the "right part of needle" loop needs to count the number of bytes that matched.
I hadn't much luck with "cargo rustc" so I did "cargo bench --verbose" and repeated a few "rustc" commands with "-C target-cpu=native" added. Normal: http://pastebin.com/4PGJMkSv Native: http://pastebin.com/EbJKjaVT
What I've noticed (normal vs. native):
test aaab_in_aab::twoway_rfind ... bench: 1,503,030 ns/iter (+/- 5,837) = 199 MB/s test aaab_in_aab::twoway_rfind ... bench: 857,109 ns/iter (+/- 1,914) = 349 MB/s
test pathological_two_way_rev::twoway_rfind ... bench: 308,092 ns/iter (+/- 382) = 194 MB/s test pathological_two_way_rev::twoway_rfind ... bench: 150,717 ns/iter (+/- 255) = 398 MB/s
Might be just quirks, I haven't examined the assembly. Just my two cents.
The byteset check is random access of a single byte
Yeah, I can see how this would be hard for LLVM to optimize. I rather hoped that the scratchspace contained some naive searchers that we'd see turned by LLVM into SSE instructions.
cool ok, I have to look at tht!
I'm only using one particular platform (llvm calls it corei7-avx), so of course that is kind of a one eyed view on optimization.
Well, give me a shout if you want to run some Opteron-3365 benchmarks.
Ok that's exciting, with native it seems to emit some avx instructions here, but yes, I think it's in a location that's outside of the innermost search loop. Getting it to emit a vectorized loop for the first counted loop (match first half of needle) would be great.
Alright, I guess I should think about multiple different platforms and microarches at some point
@bluss If you still happen to have this benchmarking code (repo is longer available), it would be great to adapt for https://github.com/shepmaster/jetscii/issues/54