aho_corasick icon indicating copy to clipboard operation
aho_corasick copied to clipboard

lower performance compared to exhaustive search ?

Open FanWan opened this issue 3 years ago • 4 comments

Hello, I got a performance problem about the ac machine of this project, and the following was my job.

First i constructed a trie tree with 2700+ patterns and did something about string-matching.

FanWan avatar May 24 '21 07:05 FanWan

#include #include #include #include #include #include

#include "aho_corasick.hpp"

std::chrono::steady_clock::time_point time_start_point() { return std::chrono::steady_clock::now(); }

double time_duration(const std::chrono::steady_clock::time_point& start) { auto end = time_start_point(); auto span = std::chrono::duration_caststd::chrono::microseconds(end - start).count(); return ((double)span) / 1000.0; }

int main(int argc, const char * argv[]) { auto time_start_0 = time_start_point(); std::string match_str = "quick 百度 red fox,paddlecloud相关的文档有吗?jumps over lazy brown dog that owns brown house";

// create instance aho_corasick::trie automation; std::ifstream in("events.txt"); char word[256]; while(!in.eof()){ in.getline(word, 256); //std::cout << "hello:" << word << std::endl; automation.insert(word); //high_user.insert(buffer); } in.close(); // add patterns to search for automation.insert("quick"); auto construct_time = time_duration(time_start_0); std::cout << "construct time: " << construct_time << std::endl;

// begin searching auto time_start = time_start_point(); auto results = automation.parse_text(match_str);

auto prepare_time = time_duration(time_start); std::cout << "size of results: "<< results.size() << std::endl; for (const auto& m: results) {

  //std::cout << m << std::endl;

} std::cout << "match cost time: " << prepare_time << std::endl;

}

FanWan avatar May 24 '21 07:05 FanWan

the outputs were as follows:

construct time: 28.581 size of results: 2 match cost time: 53.299

FanWan avatar May 24 '21 07:05 FanWan

Next, i did an experiment about exhaustive search using Python:

`import os import time cur_dir = os.path.dirname(os.path.abspath(file)) + '/' data_dir = "/Users/wanfan01/workspace/cpp/learning/ac_machine/AhoCorasick/events.txt"

if name == "main": words_ = list() for line in open(data_dir, 'r', encoding='utf8'): line = line.strip() if line: words_.append(line) words_.append("quick")

begin_time = time.time() * 1000
sentence = "quick 百度 red fox,paddlecloud相关的文档有吗?jumps over lazy brown dog that owns brown house"

match_res = []
for w in words_:
    if w in sentence:
        match_res.append(w)
end_time = time.time() * 1000
cost_time = end_time - begin_time
print("match costing time: %.3f" % cost_time)

`

the output was:

match costing time: 0.347

Obviously, ac machine (53.299 ms) performed worse than exhaustive search (0.347ms) under the same conditions.

I am so confused, and wonder if there was any mistake about my experiments ??

FanWan avatar May 24 '21 07:05 FanWan

please do performance after a few random search

yqynju avatar Jun 05 '21 07:06 yqynju