beats drop_fields processor is surprisingly slow

Hi, We're parsing some json-formatted high-volume logs, and noticed that we could only process up to 8000 documents per second per filebeat. I tried to narrow down what's the bottleneck. By just reading the file and parsing as json, I could get up to 20k documents per second. Then I commented in our list of processors that rename some fields, drop some unnecessary fields etc. I found that the slowest processor seems to be drop_fields. Dropping the first 16 fields alone takes it from 20k/s to 12k/s, then the other conditional one makes it drop to 8k/s.

Sample filebeat:

filebeat:
  inputs:
    - type: log
        paths:
          - /var/log/proxy.requests*.log*
        tags: [ "jsonformat", "proxy-requests" ]
        close_timeout: 2h
        json:
          keys_under_root: true
          overwrite_keys: true
# note our config uses filebeat.autodiscover but it doesn't seem to impact performance.

# from 20k/s to 13k/s
  - drop_fields:
      fields:
        - offset
        - cloud.image.id
        - cloud.account.id
        - docker.container.command
        - docker.container.labels.co.elastic.vcs-branch
        - docker.container.labels.co.elastic.ci.worker
        - docker.container.labels.author
        - docker.container.labels.org.label-schema.build-date
        - docker.container.labels.org.label-schema.license
        - docker.container.labels.org.label-schema.name
        - docker.container.labels.org.label-schema.vendor
        - docker.container.labels.org.label-schema.schema-version
        - docker.container.labels.org.label-schema.url
        - docker.container.labels.org.label-schema.usage
        - docker.container.labels.org.label-schema.vcs-url
        - docker.container.labels.org.opencontainers.image

# from 13k/s to 8k/s
  - drop_fields:
      when:
        and:
          - contains:
              container.name: "proxy"
          - not.equals:
              docker.container.name: "rollup"
      fields:
        - host
        - beat.hostname
        - docker.container.labels
        - docker.container.name
        - docker.container.id
        - meta.cloud.availability_zone
        - meta.cloud.instance_id
        - meta.cloud.provider
        - meta.cloud.region
        - metadata_architecture
        - metadata_availabilityZone
        - metadata_billingProducts
        - metadata_devpayProductCodes
        - metadata_imageId
        - metadata_instanceId
        - metadata_instanceType
        - metadata_kernelId
        - metadata_marketplaceProductCodes
        - metadata_pendingTime
        - metadata_ramdiskId
        - metadata_region
        - metadata_version
        - availability_zones
        - capacity
        - instance_count
        - instance_capacity

output:
  file:
    path: "/dev/"
    filename: "null"

I was watching the number of events per second while commenting out each processor, and each time the performance increased:

Turning on debug logging I see some failures to remove fields that don't exist (some of them are optinal, and some have been renamed in later versions), but it shouldn't be that expensive to check if a field exists?

2022-03-14T21:08:20.235Z        DEBUG   [processors]    processing/processors.go:128    Fail to apply processor global{drop_fields={"Fields":["offset","cloud.image.id","cloud.account.id","docker.container.command","docker.container.labels.co.elastic.vcs-branch","docker.container.labels.co.elastic.ci.worker","docker.container.labels.author","docke
r.container.labels.org.label-schema.build-date","docker.container.labels.org.label-schema.license","docker.container.labels.org.label-schema.name","docker.container.labels.org.label-schema.vendor","docker.container.labels.org.label-schema.schema-version","docker.container.labels.org.label-schema.url","docker.container.labels.org.label-schema.usag
e","docker.container.labels.org.label-schema.vcs-url","docker.container.labels.org.opencontainers.image"],"IgnoreMissing":false}, drop_fields={"Fields":["host","beat.hostname","docker.container.labels","docker.container.name","docker.container.id","meta.cloud.availability_zone","meta.cloud.instance_id","meta.cloud.provider","meta.cloud.region","m
etadata_architecture","metadata_availabilityZone","metadata_billingProducts","metadata_devpayProductCodes","metadata_imageId","metadata_instanceId","metadata_instanceType","metadata_kernelId","metadata_marketplaceProductCodes","metadata_pendingTime","metadata_ramdiskId","metadata_region","metadata_version","availability_zones","capacity","instanc
e_count","instance_capacity"],"IgnoreMissing":false}, condition=contains: map[] and !equals: map[docker.container.name:0x1e39120]}: failed to drop field [offset]: key not found; failed to drop field [cloud.image.id]: key not found; failed to drop field [cloud.account.id]: key not found; failed to drop field [docker.container.command]: key not fou
nd; failed to drop field [docker.container.labels.co.elastic.vcs-branch]: key not found; failed to drop field [docker.container.labels.co.elastic.ci.worker]: key not found; failed to drop field [docker.container.labels.author]: key not found; failed to drop field [docker.container.labels.org.label-schema.build-date]: key not found; failed to drop
 field [docker.container.labels.org.label-schema.license]: key not found; failed to drop field [docker.container.labels.org.label-schema.name]: key not found; failed to drop field [docker.container.labels.org.label-schema.vendor]: key not found; failed to drop field [docker.container.labels.org.label-schema.schema-version]: key not found; failed
to drop field [docker.container.labels.org.label-schema.url]: key not found; failed to drop field [docker.container.labels.org.label-schema.usage]: key not found; failed to drop field [docker.container.labels.org.label-schema.vcs-url]: key not found; failed to drop field [docker.container.labels.org.opencontainers.image]: key not found

(Tested on 7.12, but also tried 8.1.0 with no improvement in performance)

Mar 14 '22 21:03 henrikno

This hacky workaround seems to help some (from 8k/s to 13k/s):

  - rename:
      fields:
        - from: metadata_architecture
          to: drop.metadata_architecture
        - from: metadata_availabilityZone
          to: drop.metadata_availabilityZone
          # etc
  - drop_fields:
      fields:
        - drop

Mar 14 '22 23:03 henrikno

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

Mar 15 '22 17:03 elasticmachine

Can you start filebeat with profiling enabled and capture a CPU profile (ideally more than one) with the slow processors running? That will tell us where filebeat is spending its time.

You can start the profiling server with the --httpprof command line option (command documentation) or the configuration file (http endpoint documentation). You need to set at least http.enabled and http.pprof.enabled to true if using the configuration file.

Once you've started filebeat with profiling enabled you can capture 30 second CPU profiles with curl $HOST:$PORT/debug/pprof/profile -o profile.prof or equivalent.

Mar 15 '22 20:03 cmacknz

Attached 3 profiles + heap profile.zip These are from filebeat 8.1.0.

Mar 17 '22 20:03 henrikno

Awesome, thanks! All three CPU profiles look similar which is good. Here's a flamegraph from the first profile:

I see a bunch of time being spent formatting error messages in the dropFields processors (errors.Wrapf) that line up with what you've observed above.

Mar 17 '22 20:03 cmacknz

This is the code path contributing to a large part of the slowness here:

https://github.com/elastic/beats/blob/4a71e5e7eb8faa711785aee434a2aadd248439e0/libbeat/processors/actions/drop_fields.go#L68-L81

We are building a slice of errors, so the more errors there are the more expensive it becomes as we both incur calls to errors.Wrapf and need to reallocate the slice as it grows (it reallocates each time it doubles).

Mar 17 '22 21:03 cmacknz

Ha, just noticed it now supports ignore_missing, which was added in 7.4. We probably added the processor before it was an option. That should make it a lot cheaper.

Mar 17 '22 21:03 henrikno

Yes, I was just about to suggest that. If it is still slow try and capture another profile if it isn't too much work, the problem is much easier to find that way. Likely it will just be the next slowest path in the current set of profiles though if you can't.

Mar 17 '22 21:03 cmacknz

Went from 10k docs/s to 20k/s! For reference I ran without any processors which reaches 30k/s. profile2.zip

I suppose go/the processors doesn't have something along the line of if log.isDebugEnabled { do_error_reporting } to avoid constructing them if they're never logged?

Mar 17 '22 21:03 henrikno

I suppose go/the processors doesn't have something along the line of if log.isDebugEnabled { do_error_reporting } to avoid constructing them if they're never logged?

No, processor error handling can definitely be improved.

Interestingly in the profile2.zip you attached, the profile-no-processors profile still has processors running. A different set than the profile-ignore-missing profile though.

Here is the processing flame graph of the "no processors" case: no-processors

Here is the processing flame graph of the "ignore missing" processors case:

The profiles are largely the same outside of the processing call stack, so I've omitted those branches.

Mar 18 '22 14:03 cmacknz

The rename and drop_fields processors now spend most of their time in https://github.com/elastic/beats/blob/4bc4d19c28851306086e7f3abbe45b389d740c37/libbeat/common/mapstr.go#L436-L448

I also see the add dynamic fields metadata processor call stack spending the majority of its time in https://github.com/elastic/beats/blob/4bc4d19c28851306086e7f3abbe45b389d740c37/libbeat/common/mapstr.go#L150-L152

The improvements to make here aren't as obvious.

Mar 18 '22 14:03 cmacknz

Hmm, @henrikno what does your beat.stats.libbeat.output.events.dropped show?

@cmacknz perhaps the Agent dashboard should include the dropped events as well

Oct 05 '22 16:10 zez3

Hi! We just realized that we haven't looked into this issue in a while. We're sorry!

We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1. Thank you for your contribution!

Oct 05 '23 17:10 botelastic[bot]

ahh :(

Oct 05 '23 17:10 zez3

We have reproduced something similar to this in our internal benchmarks, no fix yet unfortunately.

Oct 05 '23 17:10 cmacknz

beats beats copied to clipboard

drop_fields processor is surprisingly slow

beats
beats copied to clipboard