beats
beats copied to clipboard
Replace the ioutils.ReadAll usage
While we were working on the offsite and analyzing some of the memory profiles of Filebeat the output to ES was consuming memory for parsing the buffer from the _bulk
response.
Looking further into it, we see that when we are executing the bulk request and consuming the response we call the ioutil.ReadAll(resp.Body)
which is deprecated and consumes more memory, especially for those big responses.
We should replace this with the Buffer and maybe the io.Copy
to copy the response Buffer into a consumable reusable Buffer.
While this issue points to a specific case of the usage of ReadAll
taking a look into the codebase of other Beats, it looks like we can even optimize that as well.
The desired output is to write some go benchmark that compares the before and after.
/cc @AndersonQ @pierrehilbert
Pinging @elastic/elastic-agent (Team:Elastic-Agent)
trying something out here: https://github.com/elastic/beats/pull/36215
Let's get it straight first:
Looking further into it, we see that when we are executing the bulk request and consuming the response we call the ioutil.ReadAll(resp.Body) which is deprecated and consumes more memory, especially for those big responses.
The only reason it's deprecated because it was moved to another package and now it's io.ReadAll
We should replace this with the Buffer and maybe the io.Copy to copy the response Buffer into a consumable reusable Buffer.
Okay, I was not sure about this and collected some data:
Analysis
io.ReadAll
allocates a buffer of 512 bytes and then re-allocates it to a larger size if the content exceeds this size. I believe in the most of our cases it would probable re-allocate.
Using io.Copy
and bytes.Buffer
would work in a similar way – first time you try to write to a bytes.Buffer
it allocates a slice to accommodate the currently written chunk, if the chunk is smaller than 64 bytes, the minimal allocation is 64 bytes. The next written chunk would probably cause another allocation. However, the buffer growing algorithm is quite sophisticated and probably more efficient in some edge cases.
Benchmarking
To prove the hypothesis I created this benchmark:
package main
import (
"bytes"
"fmt"
"io"
"io/ioutil"
"testing"
)
func BenchmarkReadBytes(b *testing.B) {
sizes := []int{
100, // 100 bytes
100 * 1024, // 100KB
1024 * 1024, // 1MB
}
for _, size := range sizes {
b.Run(fmt.Sprintf("size: %d", size), func(b *testing.B) {
// emulate a file or an HTTP response
generated := bytes.Repeat([]byte{'a'}, size)
content := bytes.NewReader(generated)
b.ResetTimer()
b.Run("deprecated ioutil.ReadAll", func(b *testing.B) {
for i := 0; i < b.N; i++ {
content.Seek(0, io.SeekStart) // reset
data, err := ioutil.ReadAll(content)
if err != nil {
b.Fatal(err)
}
if len(data) != size {
b.Fatalf("size does not match, expected %d, actual %d", size, len(data))
}
}
})
b.Run("bytes.Buffer+io.Copy", func(b *testing.B) {
for i := 0; i < b.N; i++ {
content.Seek(0, io.SeekStart) // reset
buf := bytes.NewBuffer(nil)
_, err := io.Copy(buf, content)
if err != nil {
b.Fatal(err)
}
if buf.Len() != size {
b.Fatalf("size does not match, expected %d, actual %d", size, buf.Len())
}
}
})
})
}
}
And indeed, bytes.Buffer
+ io.Copy
is significantly faster on every size I tested (100 bytes, 100KB, 1MB) and it allocates almost 5 times less memory:
go test -run=none -bench='BenchmarkReadBytes/' -benchmem -benchtime=1000x
goos: darwin
goarch: arm64
pkg: github.com/rdner/tar-test
BenchmarkReadBytes/size:_100/deprecated_ioutil.ReadAll-10 1000 226.6 ns/op 512 B/op 1 allocs/op
BenchmarkReadBytes/size:_100/bytes.Buffer+io.Copy-10 1000 173.9 ns/op 160 B/op 2 allocs/op
BenchmarkReadBytes/size:_102400/deprecated_ioutil.ReadAll-10 1000 55907 ns/op 514308 B/op 18 allocs/op
BenchmarkReadBytes/size:_102400/bytes.Buffer+io.Copy-10 1000 6875 ns/op 106544 B/op 2 allocs/op
BenchmarkReadBytes/size:_1048576/deprecated_ioutil.ReadAll-10 1000 349308 ns/op 5241107 B/op 27 allocs/op
BenchmarkReadBytes/size:_1048576/bytes.Buffer+io.Copy-10 1000 72474 ns/op 1048626 B/op 2 allocs/op
PASS
ok github.com/rdner/tar-test 0.700s
In terms of CPU (smaller is better):
In terms of memory (smaller is better):
Conclusion
Even a simple replacement of ioutil.ReadAll
(io.ReadAll
) with bytes.Buffer
+ io.Copy
(no buffer re-use) can give us a significant boost in both CPU and memory performance. So, it's worth spending time on this change.
Additionally, we can also investigate places where we might know the fixed buffer size and re-use the same buffer multiple times like we do in filestream fingerprinting. However, this is more effort and will take significantly more time.
@rdner well done, great summary.
I estimated how big of a change it would be:
-
ioutil.ReadAll
currently has 40 affected files https://github.com/search?q=repo%3Aelastic%2Fbeats%20%22ioutil.ReadAll%22&type=code -
io.ReadAll
currently has 25 affected files https://github.com/search?q=repo%3Aelastic%2Fbeats+%22io.ReadAll%22&type=code - I created an optimized version of
ReadAll
just for HTTP responses here https://github.com/elastic/elastic-agent-libs/pull/183 Once it's merged and we update to this new version ofelastic-agent-libs
it's just 65 files. It's medium effort if we don't address all the linter issues in the affected files, otherwise it's going to be a way higher effort.
Just realised that we need to fix this in the https://github.com/elastic/elastic-agent-libs repository too (10 files).
- https://github.com/search?q=repo%3Aelastic%2Felastic-agent-libs%20ioutil.ReadAll&type=code
- https://github.com/search?q=repo%3Aelastic%2Felastic-agent-libs+io.ReadAll&type=code
I think we should prioritize the input side of Beats when consuming HTTP endpoints, especially for the metricbeat. During EAH I discussed with some folks from the platform SRE complaining about the memory usage of the "agent", which was mostly consuming metrics running as a pod inside MKI.
Given the impact that @rdner showed with this, I think we should start with replacing that part of the code, then talk with SRE-o11y and canary it to measure the impact.
@alexsapran I was planning to contribute to Go itself replacing the ReadAll
implementation there with the more efficient one. Then it's just updating to the next version and gaining all this performance for free. I'll keep you updated here.
@rdner I'm browsing around looking at perf issues... any progress on the fix to Go itself? Or should we go ahead and fix our ReadAll usages and get those performance gains soon? Or if we want to break this down further, are there particular usages which may give us a better ROI on our time spent refactoring?
I think we've already got SRE-o11y ready to monitor any perf gains, as Alex mentioned above.
Hey @rowlandgeoff As @rdner is currently focused on our CI reliability, he didn't make any progress on this for a while. If we can have your team's to automate the triaging he is doing and fix some issues (not related to the tests themselves) we are having, it will allow us to come back to this important topic.
@rdner I'm browsing around looking at perf issues... any progress on the fix to Go itself? Or should we go ahead and fix our ReadAll usages and get those performance gains soon? Or if we want to break this down further, are there particular usages which may give us a better ROI on our time spent refactoring?
I think we've already got SRE-o11y ready to monitor any perf gains, as Alex mentioned above.
I found the optimizations discussed here to be incredibly insightful. Great work on the findings and investigation! Considering the current development cycle, Go 1.23 is frozen for the August release. Therefore, the most realistic timeframe for these changes to be included would be the Go 1.24 release in February 2025, assuming a CL with the improvements is approved and merged by then. The issue at go.dev/issues/50774 seems to have been stale for two years, so it will likely remain that way unless someone pushes the effort forward.
Readers and writers are excellent abstractions and two of the three most important interfaces in Go (alongside the empty interface). However, covering and optimizing all possible implementations and permutations of these is very difficult and time-consuming, often involving trade-offs. The Go standard library leverages these abstractions extensively, and if a particular case presents too many downsides, it is preferable to maintain stable performance in the standard library rather than optimize for a specific scenario.
Given that said, I would advice to not wait for this to land in the stdlib, the performance gains seems more than justifiable to just keep an optimized version of io.Readall around.