logstash-filter-grok icon indicating copy to clipboard operation
logstash-filter-grok copied to clipboard

Significant Performance Regression using Jruby Timeouts

Open yaauie opened this issue 4 years ago • 8 comments

I've run a number of performance tests to get the raw throughput per-worker to determine how different variables affect the throughput of the grok, and determined that the changes to timeouts definitely have an outsized impact on grok filters that fall through many patterns to find their match (which makes sense, because the plugin initiates a timeout on each event for each pattern on until we find a match).

6.7.1 7.4.1 JEE 7.4.1 REE
grok-single-pattern 97k 65k 57k
grok-single-pattern-NT 96k 105k 98k
grok-52-fallthrough 42k 15k 17k
grok-52-fallthrough-NT 42k 71k 68k

Caveat: In the above, NT indicates "No Timeout", or timeout_millis => 0, which we do NOT recommend for production since a single pathological input can essentially tie up CPU resources indefinitely)

I will follow up with methodology for the above, and will see if I can get a better apples-to-apples comparison with 6.7.1 using the same version of this plugin as ships with 7.x.

yaauie avatar Oct 25 '19 21:10 yaauie

@yaauie interesting. in #147 I did not see such a dramatic performance regression for single pattern use-case; also I did not test for multiple fall through which, as you point out, will multiply any performance regression. I wonder if using different LS versions also affect the performance tests? Did you use the same LS version for all tests with the different plugin versions?

I wonder if it would make sense to have a single global timeout for the whole fall through list? (it could be configurable to keep BWC?)

colinsurprenant avatar Oct 31 '19 14:10 colinsurprenant

Meta issue created in elastic/logstash#11302

colinsurprenant avatar Nov 12 '19 19:11 colinsurprenant

with https://github.com/logstash-plugins/logstash-filter-grok/pull/153 merged and available since >= 4.2.0 the fall-through case should be better when using timeout_scope => "event". for the single pattern the reported numbers still look like a considerable regression but we have yet to see how much that matters in real-world usage.

options we have left (if more performance is needed) are the following :

  • if contention is the main issue -> custom timeout thread (or executor) per worker
  • a scheduled thread pool executor wout blocking (again - assuming contention) there's an option e.g. in netty that is worth trying although it might be hard to extract out
  • reverting to using a single timeout thread but this time with proper (jruby-ext) thread polling
  • implemeting native timeout in Joni itself (best option - but requires work on joni's instructions)

kares avatar Nov 18 '19 22:11 kares

We have run into this performance regression after upgrading from 6.7.1 to 7.4.2. Given the same configuration, the update to 7.4.2 approximately doubled our event latency and dropped our events emitted by around 40%: image

There was no way to improve the performance of our pipelines after the upgrade by changing any of JVM heap size, pipeline workers or batch size. It seems that the drop in throughput was in large part due to a bottleneck in the grok filter. Given the same settings as before the upgrade, we were able to recover some of our pre-upgrade throughput by adding timeout_millis => 0 to all grok invocations in our pipeline definitions: image

We tested timeout_scope => "event" (instead of timeout_millis => 0) on some other machines and it had a very slight positive impact on overall performance: image

percygrunwald avatar Dec 21 '19 00:12 percygrunwald

Hey Percy, thanks for the data -> helpful to know timeout_scope => event did not help your case. We are considering options on tackling this without reverting to ('problematic' but fast) previous timeout implementation. It will take us a while to have smt shipable, we might need folks to do a 'beta' test when ready (as real-world scenarios such as yours might reflect performance changes more accurately).

kares avatar Jan 06 '20 14:01 kares

@kares thanks for the follow up, appreciate your work on this. Our setup is holding steady with just the addition of timeout_millis => 0 to all our grok filters presently, but would prefer a more permanent solution that doesn't require us to add this to each invocation of the filter.

percygrunwald avatar Jan 07 '20 19:01 percygrunwald

+1 on timeout_millis => 0

abdalians avatar Jan 30 '20 21:01 abdalians

@kares has there been any progress on this case? We still have a code base full of timeout_millis => 0 as a mitigation attempt for this.

percygrunwald avatar Jul 22 '20 19:07 percygrunwald