beats
beats copied to clipboard
drop_fields processor is surprisingly slow
Hi,
We're parsing some json-formatted high-volume logs, and noticed that we could only process up to 8000 documents per second per filebeat. I tried to narrow down what's the bottleneck. By just reading the file and parsing as json, I could get up to 20k documents per second. Then I commented in our list of processors that rename some fields, drop some unnecessary fields etc.
I found that the slowest processor seems to be drop_fields
. Dropping the first 16 fields alone takes it from 20k/s to 12k/s, then the other conditional one makes it drop to 8k/s.
Sample filebeat:
filebeat:
inputs:
- type: log
paths:
- /var/log/proxy.requests*.log*
tags: [ "jsonformat", "proxy-requests" ]
close_timeout: 2h
json:
keys_under_root: true
overwrite_keys: true
# note our config uses filebeat.autodiscover but it doesn't seem to impact performance.
# from 20k/s to 13k/s
- drop_fields:
fields:
- offset
- cloud.image.id
- cloud.account.id
- docker.container.command
- docker.container.labels.co.elastic.vcs-branch
- docker.container.labels.co.elastic.ci.worker
- docker.container.labels.author
- docker.container.labels.org.label-schema.build-date
- docker.container.labels.org.label-schema.license
- docker.container.labels.org.label-schema.name
- docker.container.labels.org.label-schema.vendor
- docker.container.labels.org.label-schema.schema-version
- docker.container.labels.org.label-schema.url
- docker.container.labels.org.label-schema.usage
- docker.container.labels.org.label-schema.vcs-url
- docker.container.labels.org.opencontainers.image
# from 13k/s to 8k/s
- drop_fields:
when:
and:
- contains:
container.name: "proxy"
- not.equals:
docker.container.name: "rollup"
fields:
- host
- beat.hostname
- docker.container.labels
- docker.container.name
- docker.container.id
- meta.cloud.availability_zone
- meta.cloud.instance_id
- meta.cloud.provider
- meta.cloud.region
- metadata_architecture
- metadata_availabilityZone
- metadata_billingProducts
- metadata_devpayProductCodes
- metadata_imageId
- metadata_instanceId
- metadata_instanceType
- metadata_kernelId
- metadata_marketplaceProductCodes
- metadata_pendingTime
- metadata_ramdiskId
- metadata_region
- metadata_version
- availability_zones
- capacity
- instance_count
- instance_capacity
output:
file:
path: "/dev/"
filename: "null"
I was watching the number of events per second while commenting out each processor, and each time the performance increased:
Turning on debug logging I see some failures to remove fields that don't exist (some of them are optinal, and some have been renamed in later versions), but it shouldn't be that expensive to check if a field exists?
2022-03-14T21:08:20.235Z DEBUG [processors] processing/processors.go:128 Fail to apply processor global{drop_fields={"Fields":["offset","cloud.image.id","cloud.account.id","docker.container.command","docker.container.labels.co.elastic.vcs-branch","docker.container.labels.co.elastic.ci.worker","docker.container.labels.author","docke
r.container.labels.org.label-schema.build-date","docker.container.labels.org.label-schema.license","docker.container.labels.org.label-schema.name","docker.container.labels.org.label-schema.vendor","docker.container.labels.org.label-schema.schema-version","docker.container.labels.org.label-schema.url","docker.container.labels.org.label-schema.usag
e","docker.container.labels.org.label-schema.vcs-url","docker.container.labels.org.opencontainers.image"],"IgnoreMissing":false}, drop_fields={"Fields":["host","beat.hostname","docker.container.labels","docker.container.name","docker.container.id","meta.cloud.availability_zone","meta.cloud.instance_id","meta.cloud.provider","meta.cloud.region","m
etadata_architecture","metadata_availabilityZone","metadata_billingProducts","metadata_devpayProductCodes","metadata_imageId","metadata_instanceId","metadata_instanceType","metadata_kernelId","metadata_marketplaceProductCodes","metadata_pendingTime","metadata_ramdiskId","metadata_region","metadata_version","availability_zones","capacity","instanc
e_count","instance_capacity"],"IgnoreMissing":false}, condition=contains: map[] and !equals: map[docker.container.name:0x1e39120]}: failed to drop field [offset]: key not found; failed to drop field [cloud.image.id]: key not found; failed to drop field [cloud.account.id]: key not found; failed to drop field [docker.container.command]: key not fou
nd; failed to drop field [docker.container.labels.co.elastic.vcs-branch]: key not found; failed to drop field [docker.container.labels.co.elastic.ci.worker]: key not found; failed to drop field [docker.container.labels.author]: key not found; failed to drop field [docker.container.labels.org.label-schema.build-date]: key not found; failed to drop
field [docker.container.labels.org.label-schema.license]: key not found; failed to drop field [docker.container.labels.org.label-schema.name]: key not found; failed to drop field [docker.container.labels.org.label-schema.vendor]: key not found; failed to drop field [docker.container.labels.org.label-schema.schema-version]: key not found; failed
to drop field [docker.container.labels.org.label-schema.url]: key not found; failed to drop field [docker.container.labels.org.label-schema.usage]: key not found; failed to drop field [docker.container.labels.org.label-schema.vcs-url]: key not found; failed to drop field [docker.container.labels.org.opencontainers.image]: key not found
(Tested on 7.12, but also tried 8.1.0 with no improvement in performance)
This hacky workaround seems to help some (from 8k/s to 13k/s):
- rename:
fields:
- from: metadata_architecture
to: drop.metadata_architecture
- from: metadata_availabilityZone
to: drop.metadata_availabilityZone
# etc
- drop_fields:
fields:
- drop
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)
Can you start filebeat with profiling enabled and capture a CPU profile (ideally more than one) with the slow processors running? That will tell us where filebeat is spending its time.
You can start the profiling server with the --httpprof
command line option (command documentation) or the configuration file (http endpoint documentation). You need to set at least http.enabled
and http.pprof.enabled
to true if using the configuration file.
Once you've started filebeat with profiling enabled you can capture 30 second CPU profiles with curl $HOST:$PORT/debug/pprof/profile -o profile.prof
or equivalent.
Attached 3 profiles + heap profile.zip These are from filebeat 8.1.0.
Awesome, thanks! All three CPU profiles look similar which is good. Here's a flamegraph from the first profile:

I see a bunch of time being spent formatting error messages in the dropFields processors (errors.Wrapf
) that line up with what you've observed above.
This is the code path contributing to a large part of the slowness here:
https://github.com/elastic/beats/blob/4a71e5e7eb8faa711785aee434a2aadd248439e0/libbeat/processors/actions/drop_fields.go#L68-L81
We are building a slice of errors, so the more errors there are the more expensive it becomes as we both incur calls to errors.Wrapf
and need to reallocate the slice as it grows (it reallocates each time it doubles).
Ha, just noticed it now supports ignore_missing, which was added in 7.4. We probably added the processor before it was an option. That should make it a lot cheaper.
Yes, I was just about to suggest that. If it is still slow try and capture another profile if it isn't too much work, the problem is much easier to find that way. Likely it will just be the next slowest path in the current set of profiles though if you can't.
Went from 10k docs/s to 20k/s! For reference I ran without any processors which reaches 30k/s. profile2.zip
I suppose go/the processors doesn't have something along the line of if log.isDebugEnabled { do_error_reporting }
to avoid constructing them if they're never logged?
I suppose go/the processors doesn't have something along the line of if log.isDebugEnabled { do_error_reporting } to avoid constructing them if they're never logged?
No, processor error handling can definitely be improved.
Interestingly in the profile2.zip you attached, the profile-no-processors
profile still has processors running. A different set than the profile-ignore-missing
profile though.
Here is the processing flame graph of the "no processors" case:
Here is the processing flame graph of the "ignore missing" processors case:

The profiles are largely the same outside of the processing call stack, so I've omitted those branches.
The rename
and drop_fields
processors now spend most of their time in https://github.com/elastic/beats/blob/4bc4d19c28851306086e7f3abbe45b389d740c37/libbeat/common/mapstr.go#L436-L448
I also see the add dynamic fields metadata processor call stack spending the majority of its time in https://github.com/elastic/beats/blob/4bc4d19c28851306086e7f3abbe45b389d740c37/libbeat/common/mapstr.go#L150-L152
The improvements to make here aren't as obvious.
Hmm, @henrikno what does your beat.stats.libbeat.output.events.dropped show?
@cmacknz perhaps the Agent dashboard should include the dropped events as well
Hi! We just realized that we haven't looked into this issue in a while. We're sorry!
We're labeling this issue as Stale
to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1
.
Thank you for your contribution!
ahh :(
We have reproduced something similar to this in our internal benchmarks, no fix yet unfortunately.