aho_corasick
aho_corasick copied to clipboard
lower performance compared to exhaustive search ?
Hello, I got a performance problem about the ac machine of this project, and the following was my job.
First i constructed a trie tree with 2700+ patterns and did something about string-matching.
#include
#include "aho_corasick.hpp"
std::chrono::steady_clock::time_point time_start_point() { return std::chrono::steady_clock::now(); }
double time_duration(const std::chrono::steady_clock::time_point& start) { auto end = time_start_point(); auto span = std::chrono::duration_caststd::chrono::microseconds(end - start).count(); return ((double)span) / 1000.0; }
int main(int argc, const char * argv[]) { auto time_start_0 = time_start_point(); std::string match_str = "quick 百度 red fox,paddlecloud相关的文档有吗?jumps over lazy brown dog that owns brown house";
// create instance aho_corasick::trie automation; std::ifstream in("events.txt"); char word[256]; while(!in.eof()){ in.getline(word, 256); //std::cout << "hello:" << word << std::endl; automation.insert(word); //high_user.insert(buffer); } in.close(); // add patterns to search for automation.insert("quick"); auto construct_time = time_duration(time_start_0); std::cout << "construct time: " << construct_time << std::endl;
// begin searching auto time_start = time_start_point(); auto results = automation.parse_text(match_str);
auto prepare_time = time_duration(time_start); std::cout << "size of results: "<< results.size() << std::endl; for (const auto& m: results) {
//std::cout << m << std::endl;
} std::cout << "match cost time: " << prepare_time << std::endl;
}
the outputs were as follows:
construct time: 28.581 size of results: 2 match cost time: 53.299
Next, i did an experiment about exhaustive search using Python:
`import os import time cur_dir = os.path.dirname(os.path.abspath(file)) + '/' data_dir = "/Users/wanfan01/workspace/cpp/learning/ac_machine/AhoCorasick/events.txt"
if name == "main": words_ = list() for line in open(data_dir, 'r', encoding='utf8'): line = line.strip() if line: words_.append(line) words_.append("quick")
begin_time = time.time() * 1000
sentence = "quick 百度 red fox,paddlecloud相关的文档有吗?jumps over lazy brown dog that owns brown house"
match_res = []
for w in words_:
if w in sentence:
match_res.append(w)
end_time = time.time() * 1000
cost_time = end_time - begin_time
print("match costing time: %.3f" % cost_time)
`
the output was:
match costing time: 0.347
Obviously, ac machine (53.299 ms) performed worse than exhaustive search (0.347ms) under the same conditions.
I am so confused, and wonder if there was any mistake about my experiments ??
please do performance after a few random search