algorithm
algorithm copied to clipboard
Musser-Nishanov generic sequence-matching algorithm(s)
Introduction
In 1997, David R. Musser and Gor V. Nishanov wrote a hitherto unpublished paper and accompanying code, A Fast Generic Sequence Matching Algorithm.
It struck me as odd that this algorithm had not been widely adopted, as Musser's introsort had been. So I contacted the authors and with their permission am attempting to bring it to a wider audience.
I provide the abstract from their paper here for completeness:
A string matching—and more generally, sequence matching—algorithm is presented that has a linear worst-case computing time bound, a low worst-case bound on the number of comparisons (2n), and sublinear average-case behavior that is better than that of the fastest versions of the Boyer-Moore algorithm. The algorithm retains its efficiency advantages in a wide variety of sequence matching problems of practical interest, including traditional string matching; large-alphabet problems (as in Unicode strings); and small-alphabet, long-pattern problems (as in DNA searches). Since it is expressed as a generic algorithm for searching in sequences over an arbitrary type T , it is well suited for use in generic software libraries such as the C ++ Standard Template Library. The algorithm was obtained by adding to the Knuth-Morris-Pratt algorithm one of the pattern-shifting techniques from the Boyer-Moore algorithm, with provision for use of hashing in this technique. In situations in which a hash function or random access to the sequences is not available, the algorithm falls back to an optimized version of the Knuth-Morris-Pratt algorithm.
In the benchmarks that I have run, it is faster than Boyer-Moore, but equivalent to Boyer-Moore-Horspool, on 8-bit text.
On '2-bit' text (DNA sequence) however, the custom search traits boosts the performance to 5-10x faster than the nearest rival. I have demonstrated this in a new search test, no. 5.
What is also interesting is that a simpler, fallback algorithm is provided for corpus iterators that are not random access.
Design
I have not changed the fundamental algorithms in a material way, they should be recognizable from the original code apart from some simplifications.
What I have done personally is to restructure the interface to match the existing Boost.Algorithm searchers. The Musser-Nishanov search strategy is complicated by having two algorithms, accelerated linear (AL) and hashed accelerated linear (HAL), that are mostly but not entirely independent. If the corpus iterator is not random-access or the (static) suffix size is zero, then AL is selected at compile time. Otherwise, a HAL search object is created, but this does not guarantee that HAL will be used. If the pattern size (known at run time) is smaller than the suffix size, then the HAL search object must fall back to AL.
The user-facing API is a two-part musser_nishanov
class that does the initial static choice between AL and HAL. Since HAL uses the same data structures as AL plus one more it is easy to implement HAL in terms of AL. This justified making AL a distinct class that is inherited publicly by the non-random-access musser_nishanov
class, and inherited privately by the HAL class so that it could be used as the fallback algorithm. The HAL class currently is the musser_nishanov
class for random access iterators, as opposed to being a distinct class.
The following is a crude UML diagram of the class hierarchy, showing the private inheritance in brown and public inheritance in blue, following Doxygen style.
Since the HAL search object makes its final decision about algorithm at run time, I had to represent this choice somehow. The simplest, though not necessarily most efficient, way is with bind()
and function<>
. It also chooses a null searcher if the pattern is empty. This method of storing which
algorithm to call might be causing a slight performance penalty, but it requires more detailed tests.
Selected Benchmarks
I whittled the benchmark output down for the purpose of displaying here. These values should be taken with a grain of 5-10% salt: Musser-Nishanov is not consistently faster than Boyer-Moore-Horspool for example.
8-bit random characters
This is an edited search_test2
output:
Corpus is 2756252 entries long
---- Middle -----
Pattern is 105 entries long
std::search 0.4848 seconds 100% 484781
boyer_moore_search 0.1171 seconds 24.16% 117117
boyer_moore_horspool_search 0.1076 seconds 22.2% 107602
musser_nishanov_search 0.1033 seconds 21.31% 103296
------ End ------
Pattern is 43 entries long
std::search 0.6206 seconds 100% 620587
boyer_moore_search 0.2075 seconds 33.44% 207515
boyer_moore_horspool_search 0.1742 seconds 28.07% 174218
musser_nishanov_search 0.1686 seconds 27.17% 168633
--- Not found ---
Pattern is 91 entries long
std::search 1.028 seconds 100% 1028228
boyer_moore_search 0.1971 seconds 19.17% 197079
boyer_moore_horspool_search 0.1825 seconds 17.75% 182502
musser_nishanov_search 0.1702 seconds 16.56% 170235
DNA sequences
And this is an edited search_test5
output, which is '2-bit' (DNA) sequence data, for which Musser-Nishanov has a few customized search traits, but I included just one here. The number --- here ---
is the pattern size, search is timed for finding all matches (which may be zero -- this benchmark needs some work).
Just to be clear, there is no musser_nishanov_dna
class, that is just shorthand for specialization.
Corpus is 997532 entries long
--- 10 ---
matches: 2
std::search 2.239 seconds 100% 2239480
boyer_moore object 0.6368 seconds 28.43% 636794
boyer_moore_horspool object 0.664 seconds 29.65% 663994
musser_nishanov object 0.7033 seconds 31.4% 703286
musser_nishanov_dna object 0.3932 seconds 17.56% 393151
--- 20 ---
matches: 0
std::search 1.99 seconds 100% 1989855
boyer_moore object 0.9167 seconds 46.07% 916750
boyer_moore_horspool object 1.006 seconds 50.54% 1005750
musser_nishanov object 0.9499 seconds 47.74% 949912
musser_nishanov_dna object 0.1749 seconds 8.791% 174933
--- 40 ---
matches: 0
std::search 1.881 seconds 100% 1880573
boyer_moore object 0.7201 seconds 38.29% 720095
boyer_moore_horspool object 0.8477 seconds 45.08% 847738
musser_nishanov object 0.8151 seconds 43.34% 815128
musser_nishanov_dna object 0.09294 seconds 4.942% 92943
--- 150 ---
matches: 11
std::search 1.983 seconds 100% 1982931
boyer_moore object 0.6318 seconds 31.86% 631754
boyer_moore_horspool object 0.6806 seconds 34.32% 680592
musser_nishanov object 0.6594 seconds 33.25% 659393
musser_nishanov_dna object 0.04976 seconds 2.509% 49759
Caveats (or, what is unfinished)
I did not want to let perfect be the enemy of good, but more importantly I wanted to get a conversation about this code underway as soon as possible, so I have opened this PR well before everything is complete.
- [ ] Documentation.
- [ ] 'wide char' benchmark.
- [ ] Non-random-access corpus benchmark.
- [ ] Worst-case benchmarks.
- [ ] The benchmark suite for search test 5 needs to be either trimmed or better utilized.
It's very interesting algorithm. Now i compare it with my Aho-Corasick and suffix automata.
I only look through the code, but I found one bug: callings next is ambiguous. Please use boost::next instead of next.
Also will be better, if you add Range-based interface.
Is there any way to find a set of patterns in corpus string with only one calling musser_nishanov?
Thanks for picking up the next()
ambiguity.
I'll add a Range-based interface eventually.
No, it's identical in semantics to Boyer-Moore, etc.
Thanks for the fast fixing :-)
In a few hours i will research your implementation more attentively. And will compare perfomance with my proposals(Ah-Corasick and suffix data structures).
Thanks. I'm glad we're coincidentally working on related algorithms and can probably help each other out.
Try to test your code on this test: corpus_string is the book "War and peace", patterns are the British dictionary. I have a segfault.
Goodness, that's a problem, thanks for discovering it. Where exactly did you get your copy of the corpus and dictionary? I'll try with the Gutenberg project texts when I have a chance.
Also, if you could tell me which line of code the segfault is on, that would be very helpful. Thanks, cheers.
Later I will add tests and traces.
Also I recommend write to Boost mailing list about your algorithm. There are a lot of good programmers in Boost and they can help you.
For downloading from Guttenberg will be useful: 'wget -w 2 -m http://www.gutenberg.org/robot/harvest?filetypes[]=txt&langs[]=en'
Fortunately, that was an easy bug to squash: I had forgotten to put the single-character-pattern edge case back in. Thanks again for picking it up.
Jeremy, what is current status of your implementation? Is it last version of this algorithm?
Essentially. I'm really just waiting for a sign from Marshall that it's an algorithm he would accept before I do any more work on it.
On 27 Oct 2016 8:30 AM, "Alexander" [email protected] wrote:
Jeremy, what is current status of your implementation? Is it last version of this algorithm?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/boostorg/algorithm/pull/25#issuecomment-256483022, or mute the thread https://github.com/notifications/unsubscribe-auth/AEa7sml5sOuhtjyI8VHfc0QsmfknjBP_ks5q38aBgaJpZM4J6CcR .
Good. I will try to remind Marshall about your work. After a lot of tests i see, that Musser-Nishanov is useful algorithm.
Actually, that does remind me that I started benchmarking for sequences of types larger than char, but every algorithm was worse than std::search, at least on random data. MN was the least-worst and it might still be worth completing those benchmarks.
On 27 Oct 2016 4:15 PM, "Alexander" [email protected] wrote:
Good. I will try to remind Marshall about your work. After a lot of tests i see, that Musser-Nishanov is useful algorithm.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/boostorg/algorithm/pull/25#issuecomment-256548439, or mute the thread https://github.com/notifications/unsubscribe-auth/AEa7sswwqJEKy_0ssFH1T0aKPfthy0TEks5q4DNXgaJpZM4J6CcR .
I found one very useful utility with a lot of algorithms, which currently i am porting to C++. This utility will help you with testing on different data sets (also this utility builds plots). https://github.com/smart-tool
What is current status of Musser-Nishanov algorithm?
@jeremy-murphy Did you test musser-nishanov algortihm with other algorithms? E.g. with Tuned B-M, KMP, Quick-Search, etc. ?
@mclow my tests show that Musser-Nishanov algorithm has very high performance. Should be tested with EBOM (my tests show that EBOM faster on 4-8 length patterns with English text (Bible)) (see this paper: http://www-igm.univ-mlv.fr/~lecroq/articles/acmsurv2013.pdf ).
(Finally) started playing with this. Noticed that there are no NM tests for search3 and search4. I added them, and search3 worked fine, but search4 failed, because the range overloads don't exist. See lines 225-268 of boyer_moore.hpp.
Very impressive performance.
Glad to (finally) hear from you, Marshall.
Ah, search test 3 and 4, yes, with strings as the search elements. I don't remember exactly why I avoided them last year, but I have uploaded a simple but experimental specialization of search_trait
for std::string
now. The benchmark results put it slightly ahead of BMH in performance.
@mclow What is missing in this PR? Can we accept the PR to Boost.Algorithm? If not, what should we fix before accepting?
I do have a list of things at the bottom that need finishing, most essentially of all is the documentation, which is in progress. But more feedback is always appreciated.
On 7 Jul 2017 1:04 AM, "Alexander" [email protected] wrote:
@mclow https://github.com/mclow What is missing in this PR? Can we accept the PR to Boost.Algorithm? If not, what should we fix before accepting?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/boostorg/algorithm/pull/25#issuecomment-313542672, or mute the thread https://github.com/notifications/unsubscribe-auth/AEa7ssgMkZ2uHeNhif4sLZIRA8Uyy56Iks5sLWfrgaJpZM4J6CcR .
@jeremy-murphy Ohh, sorry - i missed it. If you need any help, just tell me. I want to see Musser-Nishanov algorithm in Boost.Algorithm :)
Thanks, I just really need to motivate myself to finish the documentation and trim the fat on the search test 5 data, as I suspect that is the only real hurdle to merging. Extra benchmarks would be nice but the existing ones already demonstrate that it dominates the current algorithms.
@jeremy-murphy I know that current situation is very strange. You did a lot of hard work: implemented very fast algorithm, wrote benchmarks, etc.
I try to sync with @mclow every day :-) But he has a lot work in Boost Release Team, C++ Comitee, libc++, etc.
Don't worry! I am trying to solve your problem. Just wait (yeah, open source is so slow...)
@mclow I remind you about this search algorithm :-)
@mclow can you check this PR please?
@ZaMaZaN4iK it's OK, I'll call for Marshall's attention when it's finished. The list of caveats/unfinished things is real; I'll tick them off or remove them when they're done or remove them completely if I change my mind about them. No need to keep drawing his attention until then, but I'm always ready to engage in a conversation about what is currently here.
Very, very odd. I just got an email about three new commits. But when I come here, I see they were made 16 days ago.
Very, very odd. I just got an email about three new commits. But when I come here, I see they were made 16 days ago.
Some of the commits might be that old, I only just pushed for the first time in ages.