filebeat: make GZIP GA and add `compression` config
Proposed commit message
filebeat: Promote filestream GZIP support to GA
Promotes GZIP support in the filestream input from beta to General Availability.
Deprecates the `gzip_experimental` option in favour of the new `compression`
setting. Valid values:
- `""`: No compression (default).
- `"gzip"`: Treat all files as GZIP.
- `"auto"`: Auto-detect based on magic bytes.
Note: GZIP decoding requires `fingerprint` file identity for accurate offset
tracking. A warning is now logged if compression is enabled with a file identity
other than `fingerprint`.
Unit and integration tests have been updated to reflect these changes.
AI tools used: Cursor.
Checklist
- [x] My code follows the style guidelines of this project
- [x] I have commented my code, particularly in hard-to-understand areas
- ~~[ ] I have made corresponding changes to the documentation~~
- ~~[ ] I have made corresponding change to the default configuration files~~
- [x] I have added tests that prove my fix is effective or that my feature works. Where relevant, I have used the
stresstest.shscript to run them under stress conditions and race detector to verify their stability. - [ ] I have added an entry in
./changelog/fragmentsusing the changelog tool.
Disruptive User Impact
gzip_experimentalhas been removed. It is ignored and only a warning directing users to usecompressionis logged ifgzip_experimentalconfigured.
How to test this PR locally
verify that non-fingerprint file identities and GZIP logs a warning
- create a config with
file_identity.native:
filebeat.inputs:
- type: filestream
id: test
paths:
- /tmp/*.log
file_identity.native: ~
compression: auto
path.home: /tmp/beats/home
output.file:
enabled: true
path: /tmp/beats/home/out
filename: "output"
logging.level: debug
- run filebeat
go run . --strict.perms=false -e -c ./filebeat.yml 2>&1 | grep message
- check it fails to start and logs the error
{"log.level":"warn","@timestamp":"2025-12-12T09:32:06.622+0100","log.logger":"input","log.origin":{"function":"github.com/elastic/beats/v7/filebeat/input/filestream.config.checkUnsupportedParams","file.name":"filestream/config.go","file.line":257},"message":"compression='auto' requires file_identity to be 'fingerprint'","service.name":"filebeat","ecs.version":"1.6.0"}
To verify it works with compression auto:
- generate a gzip log file:
mkdir -p /tmp/beats/in
docker run -it --rm mingrammer/flog -f json -n 100 > /tmp/beats/in/log.ndjson
gzip /tmp/beats/in/log.ndjson
# generate another "active" log file
docker run -it --rm mingrammer/flog -f json -n 100 > /tmp/beats/in/log.ndjson
- use the following config file. Adjust as you like
http:
enabled: true
path.home: /tmp/beats/home
filebeat.inputs:
- type: filestream
id: gzip-input
enabled: true
paths:
- /tmp/beats/in/log.ndjson*
compression: auto
output.file:
path: /tmp/beats/home
filename: "output-file"
logging.level: debug
logging.metrics:
level: debug
- run filebeat
go run . --strict.perms=false -e -c ./filebeat.yml 2>&1 | grep message
- output file has 200 lines
wc -l /tmp/beats/home/out/*
200
To verify compression: gzip does not ingest plain file:
- generate a gzip log file:
mkdir -p /tmp/beats/in
docker run -it --rm mingrammer/flog -f json -n 100 > /tmp/beats/in/log.ndjson
gzip /tmp/beats/in/log.ndjson
# generate another "active" log file
docker run -it --rm mingrammer/flog -f json -n 100 > /tmp/beats/in/log.ndjson
- use the following config file. Adjust as you like
http:
enabled: true
path.home: /tmp/beats/home
filebeat.inputs:
- type: filestream
id: gzip-input
enabled: true
paths:
- /tmp/beats/in/log.ndjson*
compression: gzip
output.file:
path: /tmp/beats/home
filename: "output-file"
logging.level: debug
logging.metrics:
level: debug
- run filebeat
go run . --strict.perms=false -e -c ./filebeat.yml
- find the log:
{"log.level":"warn","@timestamp":"2025-12-12T09:40:47.453+0100","log.logger":"input.filestream.scanner","log.origin":{"function":"github.com/elastic/beats/v7/filebeat/input/filestream.(*fileScanner).GetFiles","file.name":"filestream/fswatch.go","file.line":511},"message":"cannot create a file descriptor for an ingest target \"/tmp/beats/in/log.ndjson\": failed to create gzip seeker: could not create gzip reader: gzip: invalid header","service.name":"filebeat","id":"gzip-input","ecs.version":"1.6.0"}
- output file has 100 lines, the gzip file only
wc -l /tmp/beats/home/out/*
100
To verify compression: "" ingest gzip file as a plain file:
- generate a gzip log file:
mkdir -p /tmp/beats/in
docker run -it --rm mingrammer/flog -f json -n 100 > /tmp/beats/in/log.ndjson
gzip /tmp/beats/in/log.ndjson
# generate another "active" log file
docker run -it --rm mingrammer/flog -f json -n 100 > /tmp/beats/in/log.ndjson
- use the following config file. Adjust as you like
http:
enabled: true
path.home: /tmp/beats/home
filebeat.inputs:
- type: filestream
id: gzip-input
enabled: true
paths:
- /tmp/beats/in/log.ndjson*
compression: ""
output.file:
path: /tmp/beats/home
filename: "output-file"
logging.level: debug
logging.metrics:
level: debug
- run filebeat
go run . --strict.perms=false -e -c ./filebeat.yml
- output file has 100 lines, the gzip file only
wc -l /tmp/beats/home/out/*
120
- check the data ingested from the gzip file is garbage:
grep "/tmp/beats/in/log.ndjson.gz" /tmp/beats/home/out/* | tail -n 1
{"@timestamp":"2025-12-12T08:43:28.335Z","@metadata":{"beat":"filebeat","type":"_doc","version":"9.3.0"},"ecs":{"version":"8.0.0"},"host":{"name":"mokona-elastic"},"agent":{"version":"9.3.0","ephemeral_id":"6a649a03-1cdf-4e41-80d0-075929ce8542","id":"032785ed-d7ef-491e-913e-f92063b05844","name":"mokona-elastic","type":"filebeat"},"log":{"offset":5270,"file":{"path":"/tmp/beats/in/log.ndjson.gz","device_id":"64513","inode":"43516198","fingerprint":"3b29db3923a6ebd5f44bf71e437f57d9676bfd8a1bc6ae41cbfbd0f954da1863"}},"message":"\u0013\ufffd˙4_\ufffd\ufffd\ufffd\ufffd>\ufffdy\u0008\ufffd*V\ufffdM\ufffd;=\u0002u=$\ufffd\u000e\ufffd\u0001\u0008\u0003\ufffdJ\ufffd\ufffd\ufffd#\ufffdǎ\ufffd\ufffd\u001c8\ufffd\ufffd\u0006K\ufffd!n\u000f\ufffd2[\ufffd\ufffd\ufffd4O\ufffd\r\ufffdw$͒d\ufffd\ufffd%\ufffd]{\ufffd\ufffd\ufffd^\u0011\ufffd\u001d\ufffd\ufffd\ufffd\ufffdU\u000b\u0002D\u0018d\ufffd.\ufffdNs\u0013\ufffd1b?\u000eB\u001e\ufffd\ufffd<\ufffd7\ufffdr\ufffdwyS\ufffd\t\u0001\ufffd[\ufffdA\u0004%\ufffdi\ufffd\ufffd&lmRxj\u0010\ufffd@\u001f\u0007\ufffd\u0011,\ufffdy\ufffd\ufffdQ\ufffd\ufffd&\ufffd\ufffdX>\ufffd\ufffd\ufffd <\ufffdǹ\ufffd\ufffd\ufffdw<\u001a\u0000OW$!dp\ufffd\ufffdh\ufffd\ufffd\ufffd8{\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd@]U7Pl\ufffd\ufffdz\u000fS\ufffd,H\u0017\ufffd","input":{"type":"filestream"}}
- 20 line from the gzip file
grep "/tmp/beats/in/log.ndjson.gz" /tmp/beats/home/out/* | wc -l
20
- 100 lines from the plain file
grep '/tmp/beats/in/log.ndjson"' /tmp/beats/home/out/* | wc -l
100
Related issues
- Closes https://github.com/elastic/beats/issues/47880
:robot: GitHub comments
Just comment with:
rundocs-build: Re-trigger the docs validation. (use unformatted text in the comment!)
This pull request does not have a backport label. If this is a bug or security fix, could you label this PR @AndersonQ? 🙏. For such, you'll need to label your PR with:
- The upcoming major version of the Elastic Stack
- The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)
To fixup this pull request, you need to add the backport labels for the needed branches, such as:
backport-8./dis the label to automatically backport to the8./dbranch./dis the digitbackport-active-allis the label that automatically backports to all active branches.backport-active-8is the label that automatically backports to all active minor branches for the 8 major.backport-active-9is the label that automatically backports to all active minor branches for the 9 major.
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)
This pull request is now in conflicts. Could you fix it? 🙏 To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/
git fetch upstream
git checkout -b 47880-gzip-default-on-ga upstream/47880-gzip-default-on-ga
git merge upstream/main
git push upstream 47880-gzip-default-on-ga
Is there a way to avoid the disruptive user behavior by separating gzip support in:
- enabled -- return error for file identities other than fingerprint
- auto (default) -- automatically detect incompatibility, disable gzip if that's the case
- disabled -- same as
gzip_disabled: trueor have we decided it's better to enable gzip by default?
I don't think we've discussed an approach like that and I don't recall any setting on beats being like that. Also it would not be the 1st config which ins't compatible with some other. Even though I would not expect it to happens, having an 'auto' option, we'd make to keep it up to data if anything changes that might increase or decrease GZIP compatibility.
I don't see a major issue with this idea, but I don't know if there was discussions about this approach before.
@nimarezainia, @cmacknz any thoughts here?
- enabled -- return error for file identities other than fingerprint
- auto (default) -- automatically detect incompatibility, disable gzip if that's the case
- disabled -- same as gzip_disabled: true
There are only two states:
- You are actively trying to ingest a gzip file, in which case you should use the fingerprint file identity if you don't want the same content ingested twice on rotation. There is no other choice, but I don't think this situation needs to cause filebeat to stop.
- You are not ingesting any gzip files, either because there aren't any or you excluded them. Nothing happens and you don't need the fingerprint file identity and in this case erroring unconditionally because you aren't using fingerprint is incorrect.
So the smartest thing to do is only log about potential data duplication if we actually try to ingest a .gzip file and direct people to either switch to fingerprint or to exclude the gzip files.
So the smartest thing to do is only log about potential data duplication if we actually try to ingest a .gzip file and direct people to either switch to fingerprint or to exclude the gzip files.
We don't need a new config parameter and we definitely can't have filebeat exit unconditionally anytime somebody didn't set gzip_disabled and also isn't ingesting any gzip files.
So if I'm understand you correctly the idea is to have gzip always enabled and if there is a gzip file and file identity isn't fingerprint we log a warning and keep going? Is it what you meant?
@cmacknz I updated it as you requested
@orestisfl, @colleenmcginnis when you have time, could you please re-review?
@cmacknz @AndersonQ does enabling gzip files by default mean that previous users that use wildcards (/path/to/log*) that can now match gzip files will now match /path/to/log.tar.gz as well which means they will start unexpectedly ingesting more files?
I got some feedback from PM (Bill) that we may not want to have this be enabled by default just to minimize any chance of user disruption.
Since in general we want compatibility with filelog, we can follow it's approach here which is covered by the compression parameter: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/filelogreceiver/README.md
compression Indicate the compression format of input files. If set accordingly, files will be read using a reader that uncompresses the file before scanning its content. Options are ``, gzip, or
auto.autoauto-detects file compression type. Currently, gzip files are the only compressed files auto-detected, based on ".gz" filename extension.autooption is useful when ingesting a mix of compressed and uncompressed files with the same filelogreceiver.
So I would vote we add the same compression configuration to be exactly compatible. We would default to none/unspecified which treats everything as a plain file. We could add gzip which skips auto-detection and just assumes all files are gzip. We already have the auto mode which also only supports gzip today via IsGZIP.
https://github.com/elastic/beats/blob/92bde5572b436cacaf120fc9ddceec850995bcb9/filebeat/input/filestream/file.go#L220-L223
I got some feedback from PM (Bill) that we may not want to have this be enabled by default just to minimize any chance of user disruption.
ok, makes sense.
So I would vote we add the same
compressionconfiguration to be exactly compatible. We would default to none/unspecified which treats everything as a plain file. We could addgzipwhich skips auto-detection and just assumes all files are gzip. We already have theautomode which also only supports gzip today viaIsGZIP.
Ok, so let me confirm the behaviour:
- keep the requirement to use fingerprint, if not, error and don't start the input
compression: missing/null/empty -> gzip off, every file is plain filecompression: gzip: every file is gzip. Error if the file is plain textcompression: auto: decompress GZIP, treat plain file as plain file.- log input as filestream uses absent
compression, everything is a plain file gzip_experimetal: deprecated. it setscompression: autoinstead and logs a warning saying to usecompressionand that it'll be deprecated in future versions.
keep the requirement to use fingerprint, if not, error and don't start the input
No. There is no reason for Filebeat to exit, you should only warn. You do not want to cause a data collection outage over this because Filebeat is very likely doing more data collection than just reading actively rotating gzipped logs.
compression: missing/null/empty -> gzip off, every file is plain file compression: gzip: every file is gzip. Error if the file is plain text compression: auto: decompress GZIP, treat plain file as plain file.
Yes but please use filelog and test to confirm we have interpreted it's behaviour from its documentation correctly.
log input as filestream uses absent compression, everything is a plain file
Yes the log input should not support compression.
gzip_experimetal: deprecated. it sets compression: auto instead and logs a warning saying to use compression and that it'll be deprecated in future versions.
I would just ignore this parameter and log that it's deprecated and explain what to do instead. We want to delete this parameter. It may be simpler to just delete it immediately (which will also cause it to be ignored).
@orestisfl, @cmacknz it's ready for review :)
compression: missing/null/empty -> gzip off, every file is plain file compression: gzip: every file is gzip. Error if the file is plain text compression: auto: decompress GZIP, treat plain file as plain file.
Yes but please use filelog and test to confirm we have interpreted it's behaviour from its documentation correctly.
I confirmed, it behaves like that
@orestisfl,
@cmacknz @AndersonQ does enabling gzip files by default mean that previous users that use wildcards (
/path/to/log*) that can now match gzip files will now match/path/to/log.tar.gzas well which means they will start unexpectedly ingesting more files?
Enabling GZIP ingestion has no effect on the paths glob matching. It always matches all files, regardless of their format.
The trick we needed until now was to exclude the compressed files with exclude_files. That's why the suggested value for it is \.gz$ to prevent ingesting GZIP-compressed files.
Filestream will ingest anything, if it isn't a plain file, it'll ingest garbage, a string representation of the data on the file. For example, a gzip file would end up like:
"message":"\u0013\ufffd˙4_\ufffd\ufffd\ufffd\ufffd>\ufffdy\u0008\ufffd*V\ufffdM\ufffd;=\u0002u=$\ufffd\u000e\ufffd\u0001\u0008\u0003\ufffdJ\ufffd\ufffd\ufffd#\ufffdǎ\ufffd\ufffd\u001c8\ufffd\ufffd\u0006K\ufffd!n\u000f\ufffd2[\ufffd\ufffd\ufffd4O\ufffd\r\ufffdw$͒d\ufffd\ufffd%\ufffd]{\ufffd\ufffd\ufffd^\u0011\ufffd\u001d\ufffd\ufffd\ufffd\ufffdU\u000b\u0002D\u0018d\ufffd.\ufffdNs\u0013\ufffd1b?\u000eB\u001e\ufffd\ufffd<\ufffd7\ufffdr\ufffdwyS\ufffd\t\u0001\ufffd[\ufffdA\u0004%\ufffdi\ufffd\ufffd&lmRxj\u0010\ufffd@\u001f\u0007\ufffd\u0011,\ufffdy\ufffd\ufffdQ\ufffd\ufffd&\ufffd\ufffdX>\ufffd\ufffd\ufffd <\ufffdǹ\ufffd\ufffd\ufffdw<\u001a\u0000OW$!dp\ufffd\ufffdh\ufffd\ufffd\ufffd8{\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd@]U7Pl\ufffd\ufffdz\u000fS\ufffd,H\u0017\ufffd",
This pull request is now in conflicts. Could you fix it? 🙏 To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/
git fetch upstream
git checkout -b 47880-gzip-default-on-ga upstream/47880-gzip-default-on-ga
git merge upstream/main
git push upstream 47880-gzip-default-on-ga