ugrep icon indicating copy to clipboard operation
ugrep copied to clipboard

[FR] 90% speed up by refactoring and optimizing some code

Open genivia-inc opened this issue 1 year ago • 1 comments

ugrep can run faster by refactoring the search logic to break up the large code block in advance() into separate functions that get called quicker e.g. by a switch to skip conditionals. Breaking up this large function helps the compiler a lot to optimize this code better than having to analyze a large function body.

A bit of experimentation shows significant speed improvements are attainable on ARM64 NEON at least. So it is worth the effort to refactor this code that is not fully optimized by the compiler.

Even adding a dummy printf() statement runs the code faster (!) despite the overhead of IO. So yeah, compiler optimizations aren't kicking in a much as I want to at the moment. On a more serious note, this is not new to me. I taught several years of graduate level high-performance computing. I will more closely follow (my own) advice with the next release cycles. It's just work, not difficult to do.

With these optimizations and omitting line counting when possible, such as for option -c, when searching a 13GB file we can go from

$ time ugrep -c rol en.txt
1171415
        4.54 real         2.86 user         1.40 sys

to a much lower timing

$ time ugrep -c rol en.txt
1171415
        2.40 real         0.83 user         1.39 sys

which runs 90% faster on AArch64/NEON. Other search options will benefit anywhere from 20% to 100% speedup on AArch64/NEON. Because the compiler's register allocation, instruction scheduling and alias analysis are improved, I expect these changes will also speed up searching with SSE2/AVX2. A quick test confirms this, with the same runs on Intel MacOS giving a 15% speed up and a 90% speed up when searching for the word the.

Now I have to find time to work on this. Stay tuned!

genivia-inc avatar Apr 24 '24 20:04 genivia-inc

OK, implemented and mostly tested over the weekend. Still some work to do. The executable is not larger, but faster. This update will be a lot faster on ARM devices that support NEON and AArch64.

  • updated SIMD algorithms
  • improved selection and specialization based on pattern characteristics
  • faster line counting, especially NEON/AArch64 is now super fast with new vector code that I came up with, including a fast alternative for vaddvq_s8 for horizontal vector addition on NEON
  • fix an obscure pattern match bug I found today in testing using a large generative test set I wrote some time ago to hit ugrep hard (that's how I found a bug in rg which I mention in one of my articles)

All should be ready by next week to release 6.0.

genivia-inc avatar Apr 29 '24 19:04 genivia-inc

The ugrep 6.0 benchmarks are already posted: https://github.com/Genivia/ugrep-benchmarks

This shows that ugrep is (one of) the fastest grep. Please note that no grep can (and should) absolutely claim to be always the fastest, because there are different algorithms involved with pros and cons.

Ugrep 6.0 will be released soon!

genivia-inc avatar May 06 '24 20:05 genivia-inc