beats icon indicating copy to clipboard operation
beats copied to clipboard

Benchmark GZIP support in filestream input

Open AndersonQ opened this issue 6 months ago • 1 comments

Following up https://github.com/elastic/beats/issues/44866, a comprehensive set of benchmarks is required to understand the performance characteristics, overhead, and potential risks associated with this.

Required Benchmarks

  • Performance overhead compared to plain text ingestion.
  • ingestion of many small GZIP files to assess memory usage and OOM risk.
  • Benchmark ingestion of a very large GZIP file (>64 GiB) to assess memory usage.
  • Benchmark the Kubernetes integration with a mix of plain and GZIP files, including rotation scenarios.

AndersonQ avatar Jun 17 '25 13:06 AndersonQ

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

elasticmachine avatar Jun 17 '25 13:06 elasticmachine

the benchmarks I've done so far using benchbuilder, which main metric is EPS does not show any significant change in ESP:

the benchmark run Image

an early benchmark I did locally sometimes reading gzip files was faster. Which supports it isn't a bottleneck at all:

Image

I still need to do the CPU and memory comparison

AndersonQ avatar Aug 13 '25 15:08 AndersonQ

now a bigger benchmark, 1000 files, 10mb each.

Again, no signnificant change in EPS, but more memory being used when reading gzip, which is expected.

Image Image

AndersonQ avatar Aug 15 '25 15:08 AndersonQ

I'm still trying to run a benchmark for a very large file, 64gb, but on CI the script generating log files fails. I'm still trying to reproduce locally to get to the root cause.

AndersonQ avatar Aug 15 '25 15:08 AndersonQ

Again, no signnificant change in EPS, but more memory being used when reading gzip, which is expected.

My previous experience with concurrent gzip decoding is that the number of files we can harvest at once is going to be lower when they are gzip compressed, because we'll run out of memory faster.

Fleet Server had to pool it's gzip readers to keep memory usage reasonable. https://github.com/elastic/fleet-server/pull/2994

It is good we are not observing major impact on EPS so we should focus on memory consumption because I think that will be our main problem.

cmacknz avatar Aug 15 '25 19:08 cmacknz

Fleet Server had to pool it's gzip readers to keep memory usage reasonable. elastic/fleet-server#2994

FYI. Those are gzip ** Writers **. The memory usage between gzip reading & writing isn't symmetrical. We still need to check, just mentioning so we aren't surprised if the results are different.

leehinman avatar Aug 15 '25 20:08 leehinman

There is an increase in memory consumption, but so far it wasn't problematic. Here is the dashboard from where I took the screen shot.

Running benchbuilder in CI isn't the easiest when I start to create too many files or too big files. So far I only managed to run to 1000 files. Also, I need to double-check, but I don't think we have metrics of how many files were being harvested at the same time.

AndersonQ avatar Aug 15 '25 20:08 AndersonQ

also, the memory consumption of filestream should be proportional to the number of files being harvested at the same time. Harvesting gzip files should increase memory consumption by a fixed amount per file. Thus, yes, it'll use more memory per file, but there will be a upper limit for what filebeat can harvest at the same time, regardless if the files are gzip or not.

AndersonQ avatar Aug 15 '25 20:08 AndersonQ

I needed a break from refactoring so I decided to do a quick test. I made 500 files with 200,000 records in them with spigot. And then "ingested" those files with discard output and elasticsearch output going to mock-es. Then I gzipped all the files and ingested again. All the data is from the 30s metrics. The metrics showed that 500 harvesters were running at the same time. I still have the 30s log files if you are curious about any of the other metrics.

.monitoring.metrics.beat.cpu.total.time.ms

Gzip had little to no effect on the total amount of cpu time the filebeat process consumed.

Image

Image

.monitoring.metrics.beat.memstats.rss

Gzip did have an effect on RSS, it did go up. And if I did the math right it is about ~100kb per harvester.

Image

Image

.monitoring.metrics.beat.memstats.memory_alloc

Gzip did have an effect on the runtime memory allocator.

Image

Image

.monitoring.metrics.libbeat.output.events.total

Gzip had effect in the discard output, I'm guessing it has to do with the fact that discard output does not have a bulk max size. But for the elasticsearch output there was no noticeable change in events per second.

Image

Image

leehinman avatar Aug 19 '25 01:08 leehinman

Edit: benchbuilder is still failing to run the benchmark for a large file (48gb so it won't run out of disk space). That's why each case used a different approach


Benchmark: 1,000 files of 10MB each

Benchbuilder run | Kibana dashboard

Findings:

  • Identical EPS
  • Identical CPU usage
  • Constant increase in memory usage for GZIP files.
Image

Benchmark: 1 huge file ~60gb

A local benchmark was conducted to evaluate the file reading process in isolation. This test instantiated the reader abstraction used by filestream (filestream.open), excluding overhead from other Filebeat components like processors and outputs.

go test -v -run TestBenchmark_64gb/plain -memprofile plain.mem.out -timeout=0
--- PASS: TestBenchmark_64gb (254.97s)
    --- PASS: TestBenchmark_64gb/plain (254.97s)

go test -v -run TestBenchmark_64gb/gzip -memprofile gzip.mem.out -timeout=0
--- PASS: TestBenchmark_64gb (334.15s)
    --- PASS: TestBenchmark_64gb/gzip (334.15s)

Initial tests showed a memory artifact, suggesting a leak, which was caused by collecting a CPU profile over the entire test duration. A second benchmark was therefore run without CPU profiling to gather accurate memory metrics.

go test -v -run TestBenchmark_64gb -timeout=0
=== RUN   TestBenchmark_64gb
=== RUN   TestBenchmark_64gb/plain
=== RUN   TestBenchmark_64gb/gzip
--- PASS: TestBenchmark_64gb (1130.28s)
    --- PASS: TestBenchmark_64gb/plain (251.57s)
    --- PASS: TestBenchmark_64gb/gzip (329.39s)
PASS

Findings:

  • Memory consumption is similar for both plain and GZIP files.
  • Memory usage remains low due to file streaming and on-the-fly decompression.
  • The GZIP reader does not continuously allocate new memory during operation.
  • GZIP files take longer to read, as expected.
Image Image Image

GZIP Reader vs. Writer and the Fleet Server Issue

According to the klauspost/compress docs:

Memory usage is typically 1MB for a Writer. stdlib is in the same range. If you expect to have a lot of concurrently allocated Writers consider using the stateless compress described below.

The Writer is to be significantly more memory-intensive than the Reader. The issue faced by Fleet Server was related to instantiating too many Writers at once. This is not the case for filestream, which only instantiates one Reader per file.

Additionally, the number of concurrent harvesters can be configured via harvester_limit. The results show that the additional memory per file is on the order of 100KB. Compared to the overall memory usage per file of ~4.5MB, this increase is negligible.

AndersonQ avatar Aug 22 '25 14:08 AndersonQ

I had noticed that number of harvesters affected eps, so I was curious to see how it scaled. I think it is interesting that even with 1024 harvesters we are faster than the 1 harvester case (at least on my M1 Mac). This is all using the elasticsearch output going to mock-es on the same host.

Image

Image

leehinman avatar Aug 22 '25 20:08 leehinman

I got the benchmark for a ~48gb file with benchbuilder. In this case we don't see a memory increase as there is just one instance of the GZIP reader. The CPU usage is not only virtually the same, but also constant.

Image

AndersonQ avatar Aug 26 '25 07:08 AndersonQ

Not I got why I have a hard time reading your graphs, they don't display well in dark mode. I guess it's because it's SVG, not an image.

Image

AndersonQ avatar Aug 26 '25 07:08 AndersonQ

Not I got why I have a hard time reading your graphs, they don't display well in dark mode. I guess it's because it's SVG, not an image.

LOL I never thought to check that. I though SVG would be nicer than PNG, but obviously not if you can't "see" the graph. I'll replace them with PNG, trivial to do.

leehinman avatar Aug 26 '25 17:08 leehinman

Lets try again with PNG instead of SVG.

Image Image

leehinman avatar Aug 26 '25 17:08 leehinman

@cmacknz with the current results, do you think we still need to do the Benchmark the Kubernetes integration with a mix of plain and GZIP files, including rotation scenarios.?

You already had suggested to not do the test on k8s, just simulate the scenario. We have integration tests for log rotation. Also, using benchbuilder, as far as I know, we can't run filebeat while generating logs with benchbuilder.

There is already a benchmark with 1000x10mb file (10mb is the standard size for log rotation in k8s) and one with a 48gb (our benchbuilder setup/"CI" won't support testing 64gb file without modifying the VM where it runs)

I could do a run comparing some 100s of plain VS 100s gzip vs 100s of mix. However I'm don't think it'll add any valuable new data.

what do you think?

AndersonQ avatar Aug 27 '25 11:08 AndersonQ

The benchmark results here show the performance hit of reading gzip are not a major concern.

I think we should track enabling reading of gz files in the k8s integration by default separately, and when we do that we should test what happens on a Kubernetes node that has 4 .gz logs and a single live plain text file just to see that reasonable things happen.

I don't think this needs to gate introducing this feature initially.

cmacknz avatar Aug 27 '25 20:08 cmacknz

The benchmark results here show the performance hit of reading gzip are not a major concern.

I think we should track enabling reading of gz files in the k8s integration by default separately, and when we do that we should test what happens on a Kubernetes node that has 4 .gz logs and a single live plain text file just to see that reasonable things happen.

I don't think this needs to gate introducing this feature initially.

@cmacknz, the GZIP in filebeat and on the filestream integration is being released as experimental, not yet GA. So enabling it by default on k8s doesn't seem to be quite aligned with filebeat and the filestream integration. Should we park adding gzip to k8s integration for a bit or adding it off by default and as experimental as well?

AndersonQ avatar Sep 09 '25 16:09 AndersonQ

is being released as experimental, not yet GA.

It should be beta, experimental is something we might remove and this is not one of those.

So enabling it by default on k8s doesn't seem to be quite aligned with filebeat and the filestream integration

Enabling it by default as a new beta feature with the ability to easily turn it off is fine. This will ensure you hear about unexpected problems, where as I'm not sure you'll see many people turn it on (or successfully turn it on and also remove the .gz exclusions we have everywhere by default).

cmacknz avatar Sep 09 '25 17:09 cmacknz

@karenzone just an FYI. something that we would need to document, particularly the beta/tech-preview

nimarezainia avatar Sep 10 '25 01:09 nimarezainia

is being released as experimental, not yet GA.

It should be beta, experimental is something we might remove and this is not one of those.

ok, so I'll update the docs to reflect it and the warning log

AndersonQ avatar Sep 10 '25 08:09 AndersonQ