graylog2-server remove_field pipeline function performance regression

Adding support for removing fields with regular expressions (https://github.com/Graylog2/graylog2-server/pull/15131) can have a huge performance impact on messages with many fields.

What used to be a O(1) remove from a hash map, is now looping over every message field O(n) to then perform a regex match.

Example benchmark on my system

Messages with 100 fields. A pipeline rule with 20 remove_field calls. (20 is a realistic value, I've seen setups with 50)

Now:

org.graylog.plugins.pipelineprocessor.processors.PipelineInterpreter.executionTime: mean: 336μs

throughput: 2500 msg/sec

Regex change reverted:

org.graylog.plugins.pipelineprocessor.processors.PipelineInterpreter.executionTime: mean: 5μs throughput: 11000 msg/sec

Expected Behavior

This change should've done by introducing a new pipeline function e.g. remove_fields (plural)

Current Behavior

The current remove_field function creates a performance regression for lots of users, and also changes the behavior unexpectedly: https://github.com/Graylog2/graylog2-server/issues/16047

Possible Solution

Change this is back the way it was is not possible without introducing a breaking change for some users. I suggest we restore the previous behavior (supporting a single field only) and move the new functionality to a new function.

Users that already use the regex match will need to be notified and adopt their rules.

Apr 18 '24 10:04 mpfz0r

The only current workaround I can think of is replacing multiple calls to remove_field with something like rename_field("@metadata_beat", "discard"); rename_field("@metadata_filename", "discard"); rename_field("@metadata_version", "discard"); ... remove_field("discard");

Apr 18 '24 14:04 mpfz0r

Given the severity of this bug, are we sure we just want to put this in the backlog?

May 01 '24 10:05 mpfz0r

As a severe performance issue this should probably be penned into 6.0.1

May 01 '24 11:05 tellistone

Introducing a new function for regex-removal also resolves https://github.com/Graylog2/graylog2-server/issues/16047. That issue was due to field names being treated as regex patterns when they were not intended as such.

May 03 '24 08:05 patrickmann

@mpfz0r There is definitely a slowdown between regex and not. But I don't see the same magnitude of performance impact:

20 remove_field calls without regex (for a message of 50 fields): mean 115μs
20 calls with regex: mean 550μs

This is based on the same PipelineInterpreter.executionTime metric; and is consistent with the 5x reduction in throughput you reported.

For the purpose of assessing the severity of the issue and validity of the fix, I'm curious how you achieved the 5μs value?

May 03 '24 16:05 patrickmann

graylog2-server graylog2-server copied to clipboard

remove_field pipeline function performance regression

Example benchmark on my system

Now:

Regex change reverted:

Expected Behavior

Current Behavior

Possible Solution

graylog2-server
graylog2-server copied to clipboard