hyperscan icon indicating copy to clipboard operation
hyperscan copied to clipboard

how to get accurate results using callback functions after hs_scan

Open ryankang95 opened this issue 6 years ago • 1 comments

Hi, i want to know hot to get substring that matched with pattern. For example, if i have a string "hello 2019, by 2017", i want to use '[0-9]+' to get {2019,2018}, and i can get this result with boost::regex_search(). but if i run it in hyperscan, i can't get the same result.

Here is my code files that modified according to simplegrep.c, what's wrong?

` static int eventHandler(unsigned int id, unsigned long long from, unsigned long long to, unsigned int flags, void *ctx) {

//printf("Match for pattern \"%s\" at offset %llu\n", (char *)ctx, to);
printf("matched!\n");
    int length = to - from;
char* a = (char*)malloc(length*sizeof(char));
printf("from is %llu, end is %llu, length is %d \n", from, to, length);
strncpy(a, (char *)ctx + from, length);
    printf("Match result is %s \n", a);
    printf("test is %s \n", (char*)ctx);
free(a);
//size_t *matches = (size_t *)ctx;
//(*matches)++;
return 0;

}`

hs_scan(database, inputData, length, 0, scratch, eventHandler, inputData)

my test case is ./simplegrep [0-9]+ test.txt, and test.txt just have one string "hello 2018, bye 2007"

and my result is

Scanning 21 bytes with Hyperscan matched! from is 6, end is 7, length is 1 Match result is 2 test is hello 2018, bye 2017

matched! from is 6, end is 8, length is 2 Match result is 20 test is hello 2018, bye 2017

matched! from is 6, end is 9, length is 3 Match result is 201 test is hello 2018, bye 2017

matched! from is 6, end is 10, length is 4 Match result is 2018 test is hello 2018, bye 2017

matched! from is 16, end is 17, length is 1 Match result is 2 test is hello 2018, bye 2017

matched! from is 16, end is 18, length is 2 Match result is 20 test is hello 2018, bye 2017

matched! from is 16, end is 19, length is 3 Match result is 201 test is hello 2018, bye 2017

matched! from is 16, end is 20, length is 4 Match result is 2017 test is hello 2018, bye 2017

ryankang95 avatar Oct 15 '19 07:10 ryankang95

  • Hyperscan doesn't support greedy or ungreedy semantics but reports all matches instead. So in your case, "2", "20", "201" and "2018" are all valid matches for [0-9]+. You can refer to http://intel.github.io/hyperscan/dev-reference/compilation.html#semantics for more details.
  • Hyperscan doesn't support capturing so you are not able to get matched sub-strings from your input.

In general, Hyperscan's major targeting use case is networking which takes performance as top priority so as to avoid deny of services. The underlying design of Hyperscan, differing from traditional regex matching library like PCRE, Boost, etc, avoid using backtracking-based approach that doesn't have worst performance guarantee. Greediness and capturing need to be addressed with backtracking so we don't support them so far.

xiangwang1 avatar Oct 16 '19 01:10 xiangwang1