pattern-bench icon indicating copy to clipboard operation
pattern-bench copied to clipboard

TBS Added

Open pinwhell opened this issue 10 months ago • 12 comments

pinwhell avatar Apr 10 '24 04:04 pinwhell

So I run the tests with your scanner, and to be honest it's really slow. Looking at your code, you are optimising for the wrong case. You are using SIMD to try and compare multiple bytes at a given position (acting like memcmp), but 99% of the time, even just the first byte won't match. You want to be using SIMD to compare multiple multiple positions at once (acting like memchr) instead.

mem::simd_scanner                |   1860378659 cycles =  0.217 cycles/byte |  1.00x
mem::boyer_moore_scanner         |   7509269609 cycles =  0.874 cycles/byte |  4.04x
Simple                           |  21355210217 cycles =  2.486 cycles/byte | 11.48x
DarthTon                         |  22073471819 cycles =  2.570 cycles/byte | 11.87x
CFX                              |  27702467142 cycles =  3.225 cycles/byte | 14.89x
TBS                              |  86614179798 cycles = 10.083 cycles/byte | 46.56x
Forza (Boyer-Moore Variant)      | failed
DarthTon v2                      | failed
mrexodia (horspool)              | failed

0x1F9F1 avatar Apr 20 '24 19:04 0x1F9F1

i see good, suggestion, i will add a case for that, also, the reason why it consume so much cycles, its because the type of overhead from data structures potentially, TBS thrives better in large scale setups, like multiple patterns + multiple targets + shared targets, batch scan likes setups

pinwhell avatar Apr 21 '24 12:04 pinwhell

Regarding your comment

You are using SIMD to try and compare multiple bytes at a given position (acting like memcmp), but 99% of the time, even just the first byte won't match. You want to be using SIMD to compare multiple multiple positions at once (acting like memchr) instead.

i would love to get more insight on what it exactly means, you correctly pointed the fact that the code is always checking for multiple bytes at once even if the first byte is wrong, which is valid case for improvement, but regarding the point of acting like memcmp vs acting like memchr, can you explain a bit more on it

pinwhell avatar Apr 21 '24 18:04 pinwhell

Duration Bench in the duration aspect of TBS searching is insanely fast, locally it outperforms all the other libraries as shown next:

FindPattern benchmark
Page size: 4096, allocating 64 pages (including 2 guard pages).
Running tests on 17 different implementations
===========
Running learn_more
Finding pattern 0 x 1024 took 943.973 ms.
Finding pattern 1 x 1024 took 1196.57 ms.
===========
Running learn_more v2
Finding pattern 0 x 1024 took 551.838 ms.
Finding pattern 1 x 1024 took 504.597 ms.
===========
Running fdsasdf
Finding pattern 0 x 1024 took 227.914 ms.
Finding pattern 1 x 1024 took 228.882 ms.
===========
Running DarthTon
Finding pattern 0 x 1024 took 398.162 ms.
Finding pattern 1 x 1024 took 425.736 ms.
===========
Running DarthTon v2
Finding pattern 0 x 1024 took 47.102 ms.
Finding pattern 1 x 1024 took 50.934 ms.
===========
Running Forza (Boyer-Moore Variant)
Finding pattern 0 x 1024 took 82.555 ms.
Finding pattern 1 x 1024 took 23.215 ms.
===========
Running Forza (SIMD With OpenMP)
Finding pattern 0 x 1024 took 25.226 ms.
Finding pattern 1 x 1024 took 25.937 ms.
===========
Running kokole
Finding pattern 0 x 1024 took 601.645 ms.
Finding pattern 1 x 1024 took 532.926 ms.
===========
Running mrexodia
Finding pattern 0 x 1024 took 1070.98 ms.
Finding pattern 1 x 1024 took 1041.12 ms.
===========
Running atom0s
Finding pattern 0 x 1024 took 428.373 ms.
Finding pattern 1 x 1024 took 432.73 ms.
===========
Running atom0s (mrexodia modification)
Finding pattern 0 x 1024 took 1706.57 ms.
Finding pattern 1 x 1024 took 1711.16 ms.
===========
Running mrexodia (horspool)
Finding pattern 0 x 1024 took 133.752 ms.
Failed to find pattern.
===========
Running dom1n1k_Patrick
Finding pattern 0 x 1024 took 524.371 ms.
Finding pattern 1 x 1024 took 535.93 ms.
===========
Running Michael
Finding pattern 0 x 1024 took 387.001 ms.
Finding pattern 1 x 1024 took 385.19 ms.
===========
Running superdoc1234
Finding pattern 0 x 1024 took 232.073 ms.
Finding pattern 1 x 1024 took 229.238 ms.
===========
Running stevemk14ebr
Finding pattern 0 x 1024 took 908.208 ms.
Finding pattern 1 x 1024 took 1099.22 ms.
===========
Running TBS
Finding pattern 0 x 1024 took 8.787 ms.
Finding pattern 1 x 1024 took 11.462 ms.
Done.

pinwhell avatar Apr 21 '24 19:04 pinwhell

Duration Bench in the duration aspect of TBS searching is insanely fast, locally it outperforms all the other libraries as shown next:

I think you need to clear the cache in runOne, since it does multiple iterations of each pattern

0x1F9F1 avatar Apr 21 '24 19:04 0x1F9F1

Regarding your comment

You are using SIMD to try and compare multiple bytes at a given position (acting like memcmp), but 99% of the time, even just the first byte won't match. You want to be using SIMD to compare multiple multiple positions at once (acting like memchr) instead.

i would love to get more insight on what it exactly means, you correctly pointed the fact that the code is always checking for multiple bytes at once even if the first byte is wrong, which is valid case for improvement, but regarding the point of acting like memcmp vs acting like memchr, can you explain a bit more on it

Well let's say you load 16 bytes with SIMD. You could check if those bytes match 16 bytes of the pattern. That's like memcmp. If the data you are searching has lots of near-matches, then that might be useful.

But you are searching data where there is probably only 1 match. So most of the time even 1 byte won't match.

Instead, you could check if any of those 16 bytes match a specific byte of the pattern. That's like memchr. This can let you very quickly eliminate positions, ie if there are no matches in those 16 bytes, you've just eliminated 16 positions.

0x1F9F1 avatar Apr 21 '24 20:04 0x1F9F1

Instead, you could check if any of those 16 bytes match a specific byte of the pattern. That's like memchr

inline bool CompareWithMask(const UByte* chunk1, const UByte* chunk2, size_t len, const UByte* wildCardMask) {

  if ((chunk1[0] & ~wildCardMask[0]) != (chunk2[0] & ~wildCardMask[0]))
      return false;
      
  ...
}

this should technically do the trick, and avoid all the extra 15 bytes check overhead.

pinwhell avatar Apr 21 '24 20:04 pinwhell

I think you need to clear the cache in runOne, since it does multiple iterations of each pattern

you are correct in pointing that out. it looks like it actually under-perform the other libraries impls

Running TBS
Finding pattern 0 x 1024 took 1089.66 ms.
Finding pattern 1 x 1024 took 2129.22 ms.
Done.

pinwhell avatar Apr 21 '24 20:04 pinwhell

mem::simd_scanner | 1860378659 cycles = 0.217 cycles/byte | 1.00x mem::boyer_moore_scanner | 7509269609 cycles = 0.874 cycles/byte | 4.04x Simple | 21355210217 cycles = 2.486 cycles/byte | 11.48x DarthTon | 22073471819 cycles = 2.570 cycles/byte | 11.87x CFX | 27702467142 cycles = 3.225 cycles/byte | 14.89x TBS | 86614179798 cycles = 10.083 cycles/byte | 46.56x Forza (Boyer-Moore Variant) | failed DarthTon v2 | failed mrexodia (horspool) | failed

mem::simd_scanner                |   1392954874 cycles =  0.324 cycles/byte |  1.00x | 0 failed
mem::boyer_moore_scanner         |   5500633306 cycles =  1.281 cycles/byte |  3.95x | 0 failed
TBS                              |  12819598333 cycles =  2.985 cycles/byte |  9.20x | 0 failed
DarthTon                         |  16625418180 cycles =  3.871 cycles/byte | 11.94x | 0 failed
CFX                              |  23304105672 cycles =  5.426 cycles/byte | 16.73x | 0 failed
Simple                           |  26200040356 cycles =  6.100 cycles/byte | 18.81x | 0 failed
Forza (Boyer-Moore Variant)      |   1790831487 cycles =  0.417 cycles/byte |  1.29x | 201 failed
DarthTon v2                      |   2512308412 cycles =  0.585 cycles/byte |  1.80x | 73 failed
mrexodia (horspool)              |   4017768088 cycles =  0.935 cycles/byte |  2.88x | 197 failed

with the overhead removed, ranking third in the bench

pinwhell avatar Apr 22 '24 09:04 pinwhell

mem::simd_scanner                |   1522196385 cycles =  0.354 cycles/byte |  1.00x | 0 failed
TBS                              |   2060006780 cycles =  0.480 cycles/byte |  1.35x | 0 failed
mem::boyer_moore_scanner         |   5175241926 cycles =  1.205 cycles/byte |  3.40x | 0 failed
DarthTon                         |  15968648610 cycles =  3.718 cycles/byte | 10.49x | 0 failed
Simple                           |  17578840694 cycles =  4.093 cycles/byte | 11.55x | 0 failed
CFX                              |  21821228217 cycles =  5.081 cycles/byte | 14.34x | 0 failed
Forza (Boyer-Moore Variant)      |   1888456444 cycles =  0.440 cycles/byte |  1.24x | 215 failed
DarthTon v2                      |   2197879257 cycles =  0.512 cycles/byte |  1.44x | 73 failed
mrexodia (horspool)              |   3326120906 cycles =  0.774 cycles/byte |  2.19x | 207 failed
Running TBS
Finding pattern 0 x 1024 took 51.185 ms.
Finding pattern 1 x 1024 took 49.249 ms.
Done.

@0x1F9F1 Take a look

pinwhell avatar Apr 30 '24 14:04 pinwhell

mem::simd_scanner                |   1522196385 cycles =  0.354 cycles/byte |  1.00x | 0 failed
TBS                              |   2060006780 cycles =  0.480 cycles/byte |  1.35x | 0 failed
mem::boyer_moore_scanner         |   5175241926 cycles =  1.205 cycles/byte |  3.40x | 0 failed
DarthTon                         |  15968648610 cycles =  3.718 cycles/byte | 10.49x | 0 failed
Simple                           |  17578840694 cycles =  4.093 cycles/byte | 11.55x | 0 failed
CFX                              |  21821228217 cycles =  5.081 cycles/byte | 14.34x | 0 failed
Forza (Boyer-Moore Variant)      |   1888456444 cycles =  0.440 cycles/byte |  1.24x | 215 failed
DarthTon v2                      |   2197879257 cycles =  0.512 cycles/byte |  1.44x | 73 failed
mrexodia (horspool)              |   3326120906 cycles =  0.774 cycles/byte |  2.19x | 207 failed
Running TBS
Finding pattern 0 x 1024 took 51.185 ms.
Finding pattern 1 x 1024 took 49.249 ms.
Done.

@0x1F9F1 Take a look

Looking a lot better :) Let me know when you are finished tweaking it and ready to merge.

0x1F9F1 avatar May 01 '24 19:05 0x1F9F1

ready to merge.

Finals Revisions Completed, Ready to merge...

pinwhell avatar May 02 '24 02:05 pinwhell