beats icon indicating copy to clipboard operation
beats copied to clipboard

filebeat: make GZIP GA and add `compression` config

Open AndersonQ opened this issue 3 weeks ago • 17 comments

Proposed commit message

filebeat: Promote filestream GZIP support to GA

Promotes GZIP support in the filestream input from beta to General Availability.

Deprecates the `gzip_experimental` option in favour of the new `compression`
setting. Valid values:
- `""`: No compression (default).
- `"gzip"`: Treat all files as GZIP.
- `"auto"`: Auto-detect based on magic bytes.

Note: GZIP decoding requires `fingerprint` file identity for accurate offset 
tracking. A warning is now logged if compression is enabled with a file identity
other than `fingerprint`.

Unit and integration tests have been updated to reflect these changes.

AI tools used: Cursor.

Checklist

  • [x] My code follows the style guidelines of this project
  • [x] I have commented my code, particularly in hard-to-understand areas
  • ~~[ ] I have made corresponding changes to the documentation~~
  • ~~[ ] I have made corresponding change to the default configuration files~~
  • [x] I have added tests that prove my fix is effective or that my feature works. Where relevant, I have used the stresstest.sh script to run them under stress conditions and race detector to verify their stability.
  • [ ] I have added an entry in ./changelog/fragments using the changelog tool.

Disruptive User Impact

  • gzip_experimental has been removed. It is ignored and only a warning directing users to use compression is logged if gzip_experimental configured.

How to test this PR locally

verify that non-fingerprint file identities and GZIP logs a warning

  • create a config with file_identity.native:
filebeat.inputs:
  - type: filestream
    id: test
    paths:
      - /tmp/*.log
    file_identity.native: ~
    compression: auto
path.home: /tmp/beats/home
output.file:
  enabled: true
  path: /tmp/beats/home/out
  filename: "output"
logging.level: debug
  • run filebeat
go run . --strict.perms=false -e -c ./filebeat.yml 2>&1 | grep message
  • check it fails to start and logs the error
{"log.level":"warn","@timestamp":"2025-12-12T09:32:06.622+0100","log.logger":"input","log.origin":{"function":"github.com/elastic/beats/v7/filebeat/input/filestream.config.checkUnsupportedParams","file.name":"filestream/config.go","file.line":257},"message":"compression='auto' requires file_identity to be 'fingerprint'","service.name":"filebeat","ecs.version":"1.6.0"}

To verify it works with compression auto:

  • generate a gzip log file:
mkdir -p /tmp/beats/in
docker run -it --rm mingrammer/flog -f json -n 100 > /tmp/beats/in/log.ndjson
gzip /tmp/beats/in/log.ndjson
# generate another "active" log file
docker run -it --rm mingrammer/flog -f json -n 100 > /tmp/beats/in/log.ndjson
  • use the following config file. Adjust as you like
http:
  enabled: true

path.home: /tmp/beats/home
filebeat.inputs:
    - type: filestream
      id: gzip-input
      enabled: true
      paths:
        - /tmp/beats/in/log.ndjson*
      compression: auto

output.file:
  path: /tmp/beats/home
  filename: "output-file"
logging.level: debug
logging.metrics:
  level: debug
  • run filebeat
go run . --strict.perms=false -e -c ./filebeat.yml 2>&1 | grep message
  • output file has 200 lines
wc -l /tmp/beats/home/out/*
200

To verify compression: gzip does not ingest plain file:

  • generate a gzip log file:
mkdir -p /tmp/beats/in
docker run -it --rm mingrammer/flog -f json -n 100 > /tmp/beats/in/log.ndjson
gzip /tmp/beats/in/log.ndjson
# generate another "active" log file
docker run -it --rm mingrammer/flog -f json -n 100 > /tmp/beats/in/log.ndjson
  • use the following config file. Adjust as you like
http:
  enabled: true

path.home: /tmp/beats/home
filebeat.inputs:
    - type: filestream
      id: gzip-input
      enabled: true
      paths:
        - /tmp/beats/in/log.ndjson*
      compression: gzip

output.file:
  path: /tmp/beats/home
  filename: "output-file"
logging.level: debug
logging.metrics:
  level: debug
  • run filebeat
go run . --strict.perms=false -e -c ./filebeat.yml 
  • find the log:
{"log.level":"warn","@timestamp":"2025-12-12T09:40:47.453+0100","log.logger":"input.filestream.scanner","log.origin":{"function":"github.com/elastic/beats/v7/filebeat/input/filestream.(*fileScanner).GetFiles","file.name":"filestream/fswatch.go","file.line":511},"message":"cannot create a file descriptor for an ingest target \"/tmp/beats/in/log.ndjson\": failed to create gzip seeker: could not create gzip reader: gzip: invalid header","service.name":"filebeat","id":"gzip-input","ecs.version":"1.6.0"}
  • output file has 100 lines, the gzip file only
wc -l /tmp/beats/home/out/*
100

To verify compression: "" ingest gzip file as a plain file:

  • generate a gzip log file:
mkdir -p /tmp/beats/in
docker run -it --rm mingrammer/flog -f json -n 100 > /tmp/beats/in/log.ndjson
gzip /tmp/beats/in/log.ndjson
# generate another "active" log file
docker run -it --rm mingrammer/flog -f json -n 100 > /tmp/beats/in/log.ndjson
  • use the following config file. Adjust as you like
http:
  enabled: true

path.home: /tmp/beats/home
filebeat.inputs:
    - type: filestream
      id: gzip-input
      enabled: true
      paths:
        - /tmp/beats/in/log.ndjson*
      compression: ""

output.file:
  path: /tmp/beats/home
  filename: "output-file"
logging.level: debug
logging.metrics:
  level: debug
  • run filebeat
go run . --strict.perms=false -e -c ./filebeat.yml 
  • output file has 100 lines, the gzip file only
wc -l  /tmp/beats/home/out/*
120 
  • check the data ingested from the gzip file is garbage:
grep "/tmp/beats/in/log.ndjson.gz" /tmp/beats/home/out/* | tail -n 1

{"@timestamp":"2025-12-12T08:43:28.335Z","@metadata":{"beat":"filebeat","type":"_doc","version":"9.3.0"},"ecs":{"version":"8.0.0"},"host":{"name":"mokona-elastic"},"agent":{"version":"9.3.0","ephemeral_id":"6a649a03-1cdf-4e41-80d0-075929ce8542","id":"032785ed-d7ef-491e-913e-f92063b05844","name":"mokona-elastic","type":"filebeat"},"log":{"offset":5270,"file":{"path":"/tmp/beats/in/log.ndjson.gz","device_id":"64513","inode":"43516198","fingerprint":"3b29db3923a6ebd5f44bf71e437f57d9676bfd8a1bc6ae41cbfbd0f954da1863"}},"message":"\u0013\ufffd˙4_\ufffd\ufffd\ufffd\ufffd>\ufffdy\u0008\ufffd*V\ufffdM\ufffd;=\u0002u=$\ufffd\u000e\ufffd\u0001\u0008\u0003\ufffdJ\ufffd\ufffd\ufffd#\ufffdǎ΃\ufffd\ufffd\u001c8\ufffd\ufffd\u0006K\ufffd!n\u000f\ufffd2[\ufffd\ufffd\ufffd4O\ufffd\r\ufffdw$͒d\ufffd\ufffd%\ufffd]{\ufffd\ufffd\ufffd^\u0011\ufffd\u001d\ufffd\ufffd\ufffd\ufffdU\u000b\u0002D\u0018d\ufffd.\ufffdNs\u0013\ufffd1b?\u000eB\u001e\ufffd\ufffd<\ufffd7\ufffdr\ufffdwyS\ufffd\t\u0001\ufffd[\ufffdA\u0004%\ufffdi\ufffd\ufffd&lmRxj\u0010\ufffd@\u001f\u0007\ufffd\u0011,\ufffdy\ufffd\ufffdQ\ufffd\ufffd&\ufffd\ufffdX>\ufffd\ufffd\ufffd <\ufffdǹ\ufffd\ufffd\ufffdw<\u001a\u0000OW$!dp\ufffd\ufffdh\ufffd\ufffd\ufffd8{\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd@]U7Pl\ufffd\ufffdz\u000fS\ufffd,H\u0017\ufffd","input":{"type":"filestream"}}
  • 20 line from the gzip file
grep "/tmp/beats/in/log.ndjson.gz" /tmp/beats/home/out/* | wc -l    
20
  • 100 lines from the plain file
grep '/tmp/beats/in/log.ndjson"' /tmp/beats/home/out/* | wc -l
100

Related issues

  • Closes https://github.com/elastic/beats/issues/47880

AndersonQ avatar Dec 03 '25 15:12 AndersonQ

:robot: GitHub comments

Just comment with:

  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

github-actions[bot] avatar Dec 03 '25 15:12 github-actions[bot]

This pull request does not have a backport label. If this is a bug or security fix, could you label this PR @AndersonQ? 🙏. For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed branches, such as:

  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

mergify[bot] avatar Dec 03 '25 15:12 mergify[bot]

🔍 Preview links for changed docs

github-actions[bot] avatar Dec 04 '25 13:12 github-actions[bot]

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

elasticmachine avatar Dec 05 '25 21:12 elasticmachine

This pull request is now in conflicts. Could you fix it? 🙏 To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b 47880-gzip-default-on-ga upstream/47880-gzip-default-on-ga
git merge upstream/main
git push upstream 47880-gzip-default-on-ga

mergify[bot] avatar Dec 09 '25 14:12 mergify[bot]

Is there a way to avoid the disruptive user behavior by separating gzip support in:

  1. enabled -- return error for file identities other than fingerprint
  2. auto (default) -- automatically detect incompatibility, disable gzip if that's the case
  3. disabled -- same as gzip_disabled: true

or have we decided it's better to enable gzip by default?

I don't think we've discussed an approach like that and I don't recall any setting on beats being like that. Also it would not be the 1st config which ins't compatible with some other. Even though I would not expect it to happens, having an 'auto' option, we'd make to keep it up to data if anything changes that might increase or decrease GZIP compatibility.

I don't see a major issue with this idea, but I don't know if there was discussions about this approach before.

@nimarezainia, @cmacknz any thoughts here?

AndersonQ avatar Dec 09 '25 14:12 AndersonQ

  1. enabled -- return error for file identities other than fingerprint
  2. auto (default) -- automatically detect incompatibility, disable gzip if that's the case
  3. disabled -- same as gzip_disabled: true

There are only two states:

  1. You are actively trying to ingest a gzip file, in which case you should use the fingerprint file identity if you don't want the same content ingested twice on rotation. There is no other choice, but I don't think this situation needs to cause filebeat to stop.
  2. You are not ingesting any gzip files, either because there aren't any or you excluded them. Nothing happens and you don't need the fingerprint file identity and in this case erroring unconditionally because you aren't using fingerprint is incorrect.

So the smartest thing to do is only log about potential data duplication if we actually try to ingest a .gzip file and direct people to either switch to fingerprint or to exclude the gzip files.

cmacknz avatar Dec 09 '25 21:12 cmacknz

So the smartest thing to do is only log about potential data duplication if we actually try to ingest a .gzip file and direct people to either switch to fingerprint or to exclude the gzip files.

We don't need a new config parameter and we definitely can't have filebeat exit unconditionally anytime somebody didn't set gzip_disabled and also isn't ingesting any gzip files.

So if I'm understand you correctly the idea is to have gzip always enabled and if there is a gzip file and file identity isn't fingerprint we log a warning and keep going? Is it what you meant?

AndersonQ avatar Dec 10 '25 08:12 AndersonQ

@cmacknz I updated it as you requested

@orestisfl, @colleenmcginnis when you have time, could you please re-review?

AndersonQ avatar Dec 10 '25 14:12 AndersonQ

@cmacknz @AndersonQ does enabling gzip files by default mean that previous users that use wildcards (/path/to/log*) that can now match gzip files will now match /path/to/log.tar.gz as well which means they will start unexpectedly ingesting more files?

orestisfl avatar Dec 10 '25 17:12 orestisfl

I got some feedback from PM (Bill) that we may not want to have this be enabled by default just to minimize any chance of user disruption.

Since in general we want compatibility with filelog, we can follow it's approach here which is covered by the compression parameter: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/filelogreceiver/README.md

compression Indicate the compression format of input files. If set accordingly, files will be read using a reader that uncompresses the file before scanning its content. Options are ``, gzip, or auto. auto auto-detects file compression type. Currently, gzip files are the only compressed files auto-detected, based on ".gz" filename extension. auto option is useful when ingesting a mix of compressed and uncompressed files with the same filelogreceiver.

So I would vote we add the same compression configuration to be exactly compatible. We would default to none/unspecified which treats everything as a plain file. We could add gzip which skips auto-detection and just assumes all files are gzip. We already have the auto mode which also only supports gzip today via IsGZIP.

https://github.com/elastic/beats/blob/92bde5572b436cacaf120fc9ddceec850995bcb9/filebeat/input/filestream/file.go#L220-L223

cmacknz avatar Dec 10 '25 20:12 cmacknz

I got some feedback from PM (Bill) that we may not want to have this be enabled by default just to minimize any chance of user disruption.

ok, makes sense.

So I would vote we add the same compression configuration to be exactly compatible. We would default to none/unspecified which treats everything as a plain file. We could add gzip which skips auto-detection and just assumes all files are gzip. We already have the auto mode which also only supports gzip today via IsGZIP.

Ok, so let me confirm the behaviour:

  • keep the requirement to use fingerprint, if not, error and don't start the input
  • compression: missing/null/empty -> gzip off, every file is plain file
  • compression: gzip: every file is gzip. Error if the file is plain text
  • compression: auto: decompress GZIP, treat plain file as plain file.
  • log input as filestream uses absent compression, everything is a plain file
  • gzip_experimetal: deprecated. it sets compression: auto instead and logs a warning saying to use compression and that it'll be deprecated in future versions.

AndersonQ avatar Dec 11 '25 07:12 AndersonQ

keep the requirement to use fingerprint, if not, error and don't start the input

No. There is no reason for Filebeat to exit, you should only warn. You do not want to cause a data collection outage over this because Filebeat is very likely doing more data collection than just reading actively rotating gzipped logs.

compression: missing/null/empty -> gzip off, every file is plain file compression: gzip: every file is gzip. Error if the file is plain text compression: auto: decompress GZIP, treat plain file as plain file.

Yes but please use filelog and test to confirm we have interpreted it's behaviour from its documentation correctly.

log input as filestream uses absent compression, everything is a plain file

Yes the log input should not support compression.

gzip_experimetal: deprecated. it sets compression: auto instead and logs a warning saying to use compression and that it'll be deprecated in future versions.

I would just ignore this parameter and log that it's deprecated and explain what to do instead. We want to delete this parameter. It may be simpler to just delete it immediately (which will also cause it to be ignored).

cmacknz avatar Dec 11 '25 16:12 cmacknz

@orestisfl, @cmacknz it's ready for review :)

AndersonQ avatar Dec 12 '25 08:12 AndersonQ

compression: missing/null/empty -> gzip off, every file is plain file compression: gzip: every file is gzip. Error if the file is plain text compression: auto: decompress GZIP, treat plain file as plain file.

Yes but please use filelog and test to confirm we have interpreted it's behaviour from its documentation correctly.

I confirmed, it behaves like that

AndersonQ avatar Dec 12 '25 09:12 AndersonQ

@orestisfl,

@cmacknz @AndersonQ does enabling gzip files by default mean that previous users that use wildcards (/path/to/log*) that can now match gzip files will now match /path/to/log.tar.gz as well which means they will start unexpectedly ingesting more files?

Enabling GZIP ingestion has no effect on the paths glob matching. It always matches all files, regardless of their format. The trick we needed until now was to exclude the compressed files with exclude_files. That's why the suggested value for it is \.gz$ to prevent ingesting GZIP-compressed files.

Filestream will ingest anything, if it isn't a plain file, it'll ingest garbage, a string representation of the data on the file. For example, a gzip file would end up like:

"message":"\u0013\ufffd˙4_\ufffd\ufffd\ufffd\ufffd>\ufffdy\u0008\ufffd*V\ufffdM\ufffd;=\u0002u=$\ufffd\u000e\ufffd\u0001\u0008\u0003\ufffdJ\ufffd\ufffd\ufffd#\ufffdǎ΃\ufffd\ufffd\u001c8\ufffd\ufffd\u0006K\ufffd!n\u000f\ufffd2[\ufffd\ufffd\ufffd4O\ufffd\r\ufffdw$͒d\ufffd\ufffd%\ufffd]{\ufffd\ufffd\ufffd^\u0011\ufffd\u001d\ufffd\ufffd\ufffd\ufffdU\u000b\u0002D\u0018d\ufffd.\ufffdNs\u0013\ufffd1b?\u000eB\u001e\ufffd\ufffd<\ufffd7\ufffdr\ufffdwyS\ufffd\t\u0001\ufffd[\ufffdA\u0004%\ufffdi\ufffd\ufffd&lmRxj\u0010\ufffd@\u001f\u0007\ufffd\u0011,\ufffdy\ufffd\ufffdQ\ufffd\ufffd&\ufffd\ufffdX>\ufffd\ufffd\ufffd <\ufffdǹ\ufffd\ufffd\ufffdw<\u001a\u0000OW$!dp\ufffd\ufffdh\ufffd\ufffd\ufffd8{\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd@]U7Pl\ufffd\ufffdz\u000fS\ufffd,H\u0017\ufffd",

AndersonQ avatar Dec 12 '25 11:12 AndersonQ

This pull request is now in conflicts. Could you fix it? 🙏 To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b 47880-gzip-default-on-ga upstream/47880-gzip-default-on-ga
git merge upstream/main
git push upstream 47880-gzip-default-on-ga

mergify[bot] avatar Dec 12 '25 20:12 mergify[bot]