beats Replace the ioutils.ReadAll usage

While we were working on the offsite and analyzing some of the memory profiles of Filebeat the output to ES was consuming memory for parsing the buffer from the _bulk response.

Looking further into it, we see that when we are executing the bulk request and consuming the response we call the ioutil.ReadAll(resp.Body) which is deprecated and consumes more memory, especially for those big responses.

We should replace this with the Buffer and maybe the io.Copy to copy the response Buffer into a consumable reusable Buffer.

While this issue points to a specific case of the usage of ReadAll taking a look into the codebase of other Beats, it looks like we can even optimize that as well.

The desired output is to write some go benchmark that compares the before and after.

/cc @AndersonQ @pierrehilbert

Jul 25 '23 08:07 alexsapran

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

Jul 25 '23 08:07 elasticmachine

trying something out here: https://github.com/elastic/beats/pull/36215

Aug 03 '23 10:08 amitkanfer

Let's get it straight first:

Looking further into it, we see that when we are executing the bulk request and consuming the response we call the ioutil.ReadAll(resp.Body) which is deprecated and consumes more memory, especially for those big responses.

The only reason it's deprecated because it was moved to another package and now it's io.ReadAll

We should replace this with the Buffer and maybe the io.Copy to copy the response Buffer into a consumable reusable Buffer.

Okay, I was not sure about this and collected some data:

Analysis

io.ReadAll allocates a buffer of 512 bytes and then re-allocates it to a larger size if the content exceeds this size. I believe in the most of our cases it would probable re-allocate.

Using io.Copy and bytes.Buffer would work in a similar way – first time you try to write to a bytes.Buffer it allocates a slice to accommodate the currently written chunk, if the chunk is smaller than 64 bytes, the minimal allocation is 64 bytes. The next written chunk would probably cause another allocation. However, the buffer growing algorithm is quite sophisticated and probably more efficient in some edge cases.

Benchmarking

To prove the hypothesis I created this benchmark:

package main

import (
	"bytes"
	"fmt"
	"io"
	"io/ioutil"
	"testing"
)

func BenchmarkReadBytes(b *testing.B) {
	sizes := []int{
		100,         // 100 bytes
		100 * 1024,  // 100KB
		1024 * 1024, // 1MB
	}
	for _, size := range sizes {
		b.Run(fmt.Sprintf("size: %d", size), func(b *testing.B) {

			// emulate a file or an HTTP response
			generated := bytes.Repeat([]byte{'a'}, size)
			content := bytes.NewReader(generated)
			b.ResetTimer()

			b.Run("deprecated ioutil.ReadAll", func(b *testing.B) {
				for i := 0; i < b.N; i++ {
					content.Seek(0, io.SeekStart) // reset
					data, err := ioutil.ReadAll(content)
					if err != nil {
						b.Fatal(err)
					}
					if len(data) != size {
						b.Fatalf("size does not match, expected %d, actual %d", size, len(data))
					}
				}
			})
			b.Run("bytes.Buffer+io.Copy", func(b *testing.B) {
				for i := 0; i < b.N; i++ {
					content.Seek(0, io.SeekStart) // reset
					buf := bytes.NewBuffer(nil)
					_, err := io.Copy(buf, content)
					if err != nil {
						b.Fatal(err)
					}
					if buf.Len() != size {
						b.Fatalf("size does not match, expected %d, actual %d", size, buf.Len())
					}
				}
			})
		})
	}
}

And indeed, bytes.Buffer + io.Copy is significantly faster on every size I tested (100 bytes, 100KB, 1MB) and it allocates almost 5 times less memory:

go test -run=none -bench='BenchmarkReadBytes/' -benchmem -benchtime=1000x
goos: darwin
goarch: arm64
pkg: github.com/rdner/tar-test
BenchmarkReadBytes/size:_100/deprecated_ioutil.ReadAll-10                   1000               226.6 ns/op           512 B/op          1 allocs/op
BenchmarkReadBytes/size:_100/bytes.Buffer+io.Copy-10                        1000               173.9 ns/op           160 B/op          2 allocs/op
BenchmarkReadBytes/size:_102400/deprecated_ioutil.ReadAll-10                1000             55907 ns/op          514308 B/op         18 allocs/op
BenchmarkReadBytes/size:_102400/bytes.Buffer+io.Copy-10                     1000              6875 ns/op          106544 B/op          2 allocs/op
BenchmarkReadBytes/size:_1048576/deprecated_ioutil.ReadAll-10               1000            349308 ns/op         5241107 B/op         27 allocs/op
BenchmarkReadBytes/size:_1048576/bytes.Buffer+io.Copy-10                    1000             72474 ns/op         1048626 B/op          2 allocs/op
PASS
ok      github.com/rdner/tar-test       0.700s

In terms of CPU (smaller is better):

In terms of memory (smaller is better):

Conclusion

Even a simple replacement of ioutil.ReadAll (io.ReadAll) with bytes.Buffer + io.Copy (no buffer re-use) can give us a significant boost in both CPU and memory performance. So, it's worth spending time on this change.

Additionally, we can also investigate places where we might know the fixed buffer size and re-use the same buffer multiple times like we do in filestream fingerprinting. However, this is more effort and will take significantly more time.

Feb 07 '24 10:02 rdner

@rdner well done, great summary.

Feb 07 '24 14:02 amitkanfer

I estimated how big of a change it would be:

ioutil.ReadAll currently has 40 affected files https://github.com/search?q=repo%3Aelastic%2Fbeats%20%22ioutil.ReadAll%22&type=code
io.ReadAll currently has 25 affected files https://github.com/search?q=repo%3Aelastic%2Fbeats+%22io.ReadAll%22&type=code
I created an optimized version of ReadAll just for HTTP responses here https://github.com/elastic/elastic-agent-libs/pull/183 Once it's merged and we update to this new version of elastic-agent-libs it's just 65 files. It's medium effort if we don't address all the linter issues in the affected files, otherwise it's going to be a way higher effort.

Feb 12 '24 16:02 rdner

Just realised that we need to fix this in the https://github.com/elastic/elastic-agent-libs repository too (10 files).

https://github.com/search?q=repo%3Aelastic%2Felastic-agent-libs%20ioutil.ReadAll&type=code
https://github.com/search?q=repo%3Aelastic%2Felastic-agent-libs+io.ReadAll&type=code

Feb 13 '24 08:02 rdner

I think we should prioritize the input side of Beats when consuming HTTP endpoints, especially for the metricbeat. During EAH I discussed with some folks from the platform SRE complaining about the memory usage of the "agent", which was mostly consuming metrics running as a pod inside MKI.

Given the impact that @rdner showed with this, I think we should start with replacing that part of the code, then talk with SRE-o11y and canary it to measure the impact.

Mar 05 '24 15:03 alexsapran

@alexsapran I was planning to contribute to Go itself replacing the ReadAll implementation there with the more efficient one. Then it's just updating to the next version and gaining all this performance for free. I'll keep you updated here.

Mar 05 '24 20:03 rdner

@rdner I'm browsing around looking at perf issues... any progress on the fix to Go itself? Or should we go ahead and fix our ReadAll usages and get those performance gains soon? Or if we want to break this down further, are there particular usages which may give us a better ROI on our time spent refactoring?

I think we've already got SRE-o11y ready to monitor any perf gains, as Alex mentioned above.

Jun 04 '24 19:06 rowlandgeoff

Hey @rowlandgeoff As @rdner is currently focused on our CI reliability, he didn't make any progress on this for a while. If we can have your team's to automate the triaging he is doing and fix some issues (not related to the tests themselves) we are having, it will allow us to come back to this important topic.

Jun 05 '24 07:06 pierrehilbert

@rdner I'm browsing around looking at perf issues... any progress on the fix to Go itself? Or should we go ahead and fix our ReadAll usages and get those performance gains soon? Or if we want to break this down further, are there particular usages which may give us a better ROI on our time spent refactoring?

I think we've already got SRE-o11y ready to monitor any perf gains, as Alex mentioned above.

I found the optimizations discussed here to be incredibly insightful. Great work on the findings and investigation! Considering the current development cycle, Go 1.23 is frozen for the August release. Therefore, the most realistic timeframe for these changes to be included would be the Go 1.24 release in February 2025, assuming a CL with the improvements is approved and merged by then. The issue at go.dev/issues/50774 seems to have been stale for two years, so it will likely remain that way unless someone pushes the effort forward.

Readers and writers are excellent abstractions and two of the three most important interfaces in Go (alongside the empty interface). However, covering and optimizing all possible implementations and permutations of these is very difficult and time-consuming, often involving trade-offs. The Go standard library leverages these abstractions extensively, and if a particular case presents too many downsides, it is preferable to maintain stable performance in the standard library rather than optimize for a specific scenario.

Given that said, I would advice to not wait for this to land in the stdlib, the performance gains seems more than justifiable to just keep an optimized version of io.Readall around.

Jul 04 '24 12:07 mauri870

beats beats copied to clipboard

Replace the ioutils.ReadAll usage

Analysis

Benchmarking

In terms of CPU (smaller is better):

In terms of memory (smaller is better):

Conclusion

beats
beats copied to clipboard