StringZilla
                                
                                 StringZilla copied to clipboard
                                
                                    StringZilla copied to clipboard
                            
                            
                            
                        Better Heuristics for Substring Search
So I recently had to implement a fast text filter for work (in browser JavaScript, so no use for your library (yet) I'm afraid), and used the same-ish kind of insight but inverted: given that nearby characters are strongly correlated, then the first and last characters in a substring should be the least correlated. Or put in other words: suppose two (eventually) different words match their first n characters so far. Of all remaining characters, character n+1 is most likely to be the same, so actually the worst character to test next.
Statistically speaking, comparing the first and last characters has a higher chance to filter out mismatching strings early, avoiding false positives (of course, I'm well aware that linear memory access patterns might have a much bigger impact than this statistic).
This was not my original insight - Wojciech Muła also does this in one of his SIMD string search algorithms: http://0x80.pl/articles/simd-strfind.html#algorithm-1-generic-simd (without justification though, he left that as an exercise for the reader)
Looking through the code-base, I see that you first match on a four-character needle, then finish comparing on the suffixes with this sz_equal function: https://github.com/ashvardanian/StringZilla/blob/14e7a78edcc16b031c06b375aac1f66d8f19d45a/stringzilla/stringzilla.h#L94-L98
My idea could be (relatively) quickly benchmarked by modifying this function. I haven't written C++ in a while, but I think this would be a correct "first test the last character, then test the remaining string characters" variation:
sz_bool_t sz_equal(sz_string_start_t a, sz_string_start_t b, sz_size_t length) {
    sz_string_start_t const a_end = a + length - 1;
    if (a <= a_end && *a_end != b[length - 1]) return false;
    while (a != a_end && *a == *b) a++, b++;
    return a_end == a;
}
Or a "compare back-to-front" variation:
sz_bool_t sz_equal(sz_string_start_t a, sz_string_start_t b, sz_size_t length) {
    sz_string_start_t a_end = a + length - 1;
    sz_string_start_t b_end = b + length - 1;
    if (a <= a_end && *a_end != b[length - 1]) return false;
    while (a != a_end && *a == *b) a++, b++;
    return a_end == a;
}
I am aware that simple forward linear memory access is faster than random or backward memory access, so I would not be surprised if both of these functions are slower in practice, regardless of what statistics says. But we won't know until we try, right?
For completeness sake here is my JS data structure, although I don't think it is particularly relevant as it solves a slightly different (but adjacent) problem than StringZilla
https://observablehq.com/@jobleonard/a-data-structure-for-table-row-filtering
The use-case is real-time filtering a data table based on user-input, leaving only rows that contain a substring match in any data cell with the input string. So it's more optimized for "latency". It pre-processes the indices of all trigrams of the strings in the table, and uses these as needles. For an input string it filters on the first trigram, then filters backwards from the last trigram in the input substring. Because trigrams are looked up from a hasmap the forward or backward access pattern concern does not apply here. It also uses previous results to narrow down the results faster (e.g. if a user typed the, then types a y to produce they, the filter only searches the existing results of the instead of starting over)
Hi @JobLeonard! Sorry for a late response, didn’t see the issue.
That’s a good suggestion for the serial version! I haven’t spent much time optimizing it.
On most modern CPUs the forward and backward passes over memory are equally fast, AFAIK. It might be a good idea to also add a version that uses 64 bit integers, if misaligned reads are allowed. Would you be interested in trying a couple of such approaches and submitting a PR?
In case you do, the CONTRIBUTING.md file references several datasets for benchmarks 🤗
I'll take a shot at it, with the following caveats:
- I only have one laptop to benchmark it with (a six years old Lenovo P51 with an Intel® Core™ i7-7820HQ Processor running KDE Neon (Ubuntu LTS 22.04 based))
- don't have too much spare time so the GCC 12 I already have installed will have to do.
- "read string as 64 bit integers" sounds like it's great when everything aligns neatly, but wouldn't that require a lot of special casing? Checks for whether sz_size_t lengthis less than eight characters, and for whethersz_string_start_t aand/orsz_string_start_t bstart misaligned, or end misaligned. Start or end within the same 8-byte word or not. That's a lot of variations to consider (unless there's an obvious bitmasking + aligned reads trick that I'm too tired to work out in my head right now). So I think I'll skip that for now.
So my benchmark nrs will be limited to a few simple variations of sz_equal on x86-64, with AVX2 thrown in as a point of comparison too, and therefore only useful as a first sanity check for whether this idea is worth investigating further. Is that ok?
Sure @JobLeonard, you can start and I can join a bit later.
For benchmarking I often take cloud instances (c7g and r7iz) to gain access to more hardware. Might be useful to you too 🤗
One thing to watch out for - on very short strings (well under 64 bytes) we are optimizing the for-loop. On longer strings, if we take the first and the last character - we end up fetching 2 cache lines from each string, instead of just one.
Hi @JobLeonard,
I have a big update! I've generalized the substring search methods to be able to match different characters within the needle. The method that infers the best targets is called _sz_locate_needle_anomalies. For long needles and small alphabets, updating it may  have a noticeable impact. Here is the code:
https://github.com/ashvardanian/StringZilla/blob/cbdae26b222a66c746af1938512932f001dcbfcb/include/stringzilla/stringzilla.h#L1586-L1657
Not sure about what would be the best dataset for such benchmarks, seems like this is related to #91.
Thanks for the heads-up, looks like a worthwhile change for the examples given in the doc comment.
I was about to write that I'm still interested in giving this a go but have been very busy at work in the last month. Not entirely sure when I manage to free up some time again but just wanted to re-assure you that I haven't forgotten about it!