go-libaudit
go-libaudit copied to clipboard
reorder structs for better memory usage
:green_heart: Build Succeeded
the below badges are clickable and redirect to their specific view in the CI or DOCS
![]()
![]()
![]()
![]()
![]()
Expand to view the summary
Build stats
-
Start Time: 2022-07-11T06:09:21.143+0000
-
Duration: 3 min 33 sec
Test stats :test_tube:
Test | Results |
---|---|
Failed | 0 |
Passed | 386 |
Skipped | 40 |
Total | 426 |
I'd like to see what impact this has on time spent in runtime.scanobject
.
I've done some experiments with a family of type T1 struct { next *T1; pad [n]byte }
and type T2 struct { pad [n]byte; next *T2 }
benchmarking the allocation and GC reaping of the allocations and I don't see any reliable real difference between these layouts in terms of either performance or time spent in GC (particularly in scanobject
) over a wide range of padding sizes (note the variance on performance and time in scanobject
is high).
It looks like auparse.AuditMessage could also be reordered. Do you want to fix that one too?
done :white_check_mark:
This is the output of the benchmarking that I did. Apart from very large cases (padding with 2^16 bytes) there is no difference in behaviours between front loaded pointers and rear loaded pointers. There is a significant difference between slab-allocated and element-wise allocation as expected and which I think is more likely the cause of the problem that this is intended to address.
time/op | |
---|---|
First1-16 | 7.32ms ± 7% |
Last1-16 | 6.76ms ± 8% |
First2-16 | 7.30ms ± 8% |
Last2-16 | 6.85ms ± 7% |
First4-16 | 7.85ms ± 9% |
Last4-16 | 8.16ms ± 8% |
First8-16 | 38.7ms ± 7% |
Last8-16 | 39.0ms ± 4% |
First16-16 | 607ms ±80% |
Last16-16 | 541ms ± 3% |
First1LL-16 | 45.0ms ± 4% |
Last1LL-16 | 46.0ms ± 3% |
First2LL-16 | 44.2ms ± 3% |
Last2LL-16 | 45.3ms ± 2% |
First4LL-16 | 47.2ms ± 4% |
Last4LL-16 | 49.5ms ± 3% |
First8LL-16 | 113ms ± 2% |
Last8LL-16 | 165ms ± 5% |
First16LL-16 | 709ms ± 2% |
Last16LL-16 | 1.79s ± 5% |
The details of the benchmarks:
There are a family of types (sub N
with the numbers above)
type last1 struct {
pad [1 << N]byte
n *lastN
}
type first1 struct {
n *firstN
pad [1 << N]byte
}
and each was then benchmarked with the following for slab allocation (16 used only 1e5 due to memory constraint).
func BenchmarkFirstN(b *testing.B) {
for i := 0; i < b.N; i++ {
a := make([]firstN, 1e6)
for j := range a {
a[j].n = &a[(j+1)%len(a)]
}
}
}
func BenchmarkLastN(b *testing.B) {
for i := 0; i < b.N; i++ {
a := make([]lastN, 1e6)
for j := range a {
a[j].n = &a[(j+1)%len(a)]
}
}
}
and element-wise allocation
func BenchmarkFirstNLL(b *testing.B) {
for i := 0; i < b.N; i++ {
const n = 1e6
h := &firstN{}
orig := h
for j := 1; j < n; j++ {
h = &firstN{n: h}
}
h.n = orig
}
}
func BenchmarkLastNLL(b *testing.B) {
for i := 0; i < b.N; i++ {
const n = 1e6
h := &lastN{}
orig := h
for j := 1; j < n; j++ {
h = &lastN{n: h}
}
h.n = orig
}
}
Looking at the current top 10 functions of auditbeat
from Elastic Cloud QA, these small changes will make a difference:
As I'm not allowed to (test-)deploy to Elastic Cloud QA, I can't provide a diff atm.
I agree that if the changes made a difference that it would be significant given the proportion of the time spent in scanobject
, but the change in ordering doesn't seem to as far as I can see. At least not for the sizes that we are dealing with.
$ go test -bench Last8$ -count 10 -cpuprofile last8.pprof
goos: darwin
goarch: amd64
pkg: scanobjord
cpu: Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
BenchmarkLast8-16 31 37112835 ns/op
BenchmarkLast8-16 32 37242561 ns/op
BenchmarkLast8-16 31 38081059 ns/op
BenchmarkLast8-16 27 37187969 ns/op
BenchmarkLast8-16 32 37338076 ns/op
BenchmarkLast8-16 34 35064717 ns/op
BenchmarkLast8-16 32 35461664 ns/op
BenchmarkLast8-16 32 37593525 ns/op
BenchmarkLast8-16 31 38852988 ns/op
BenchmarkLast8-16 33 35373119 ns/op
PASS
ok scanobjord 14.901s
$ go test -bench First8$ -count 10 -cpuprofile first8.pprof
goos: darwin
goarch: amd64
pkg: scanobjord
cpu: Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
BenchmarkFirst8-16 31 35894025 ns/op
BenchmarkFirst8-16 28 37102746 ns/op
BenchmarkFirst8-16 31 36628196 ns/op
BenchmarkFirst8-16 33 37609142 ns/op
BenchmarkFirst8-16 32 35530509 ns/op
BenchmarkFirst8-16 32 37586055 ns/op
BenchmarkFirst8-16 28 37761853 ns/op
BenchmarkFirst8-16 32 35432062 ns/op
BenchmarkFirst8-16 32 37318492 ns/op
BenchmarkFirst8-16 31 35174971 ns/op
PASS
ok scanobjord 14.623s
$ go tool pprof scanobjord.test last8.pprof
File: scanobjord.test
Type: cpu
Time: Jul 11, 2022 at 7:00pm (ACST)
Duration: 14.69s, Total samples = 44.07s (299.96%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 40560ms, 92.04% of 44070ms total
Dropped 87 nodes (cum <= 220.35ms)
Showing top 10 nodes out of 43
flat flat% sum% cum cum%
19240ms 43.66% 43.66% 23600ms 53.55% runtime.scanobject
4610ms 10.46% 54.12% 4610ms 10.46% runtime.memmove
4020ms 9.12% 63.24% 4570ms 10.37% runtime.heapBitsSetType
3050ms 6.92% 70.16% 3050ms 6.92% runtime.memclrNoHeapPointers
2710ms 6.15% 76.31% 3630ms 8.24% runtime.gcWriteBarrier
2330ms 5.29% 81.60% 2330ms 5.29% runtime.heapBits.bits (inline)
1850ms 4.20% 85.80% 1960ms 4.45% runtime.heapBits.next (inline)
1270ms 2.88% 88.68% 12570ms 28.52% scanobjord.BenchmarkLast8
850ms 1.93% 90.61% 860ms 1.95% runtime.pthread_kill
630ms 1.43% 92.04% 700ms 1.59% runtime.pthread_cond_wait
(pprof) quit
$ go tool pprof scanobjord.test first8.pprof
File: scanobjord.test
Type: cpu
Time: Jul 11, 2022 at 7:01pm (ACST)
Duration: 14.50s, Total samples = 43.02s (296.69%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 39640ms, 92.14% of 43020ms total
Dropped 72 nodes (cum <= 215.10ms)
Showing top 10 nodes out of 53
flat flat% sum% cum cum%
18720ms 43.51% 43.51% 23010ms 53.49% runtime.scanobject
4710ms 10.95% 54.46% 4710ms 10.95% runtime.memmove
4030ms 9.37% 63.83% 4510ms 10.48% runtime.heapBitsSetType
2710ms 6.30% 70.13% 3620ms 8.41% runtime.gcWriteBarrier
2690ms 6.25% 76.38% 2690ms 6.25% runtime.memclrNoHeapPointers
2430ms 5.65% 82.03% 2430ms 5.65% runtime.heapBits.bits (inline)
1730ms 4.02% 86.05% 1780ms 4.14% runtime.heapBits.next (inline)
1390ms 3.23% 89.28% 12290ms 28.57% scanobjord.BenchmarkFirst8
670ms 1.56% 90.84% 680ms 1.58% runtime.pthread_kill
560ms 1.30% 92.14% 640ms 1.49% runtime.pthread_cond_wait
The problem I see with the posted benchmark is, that is focusing purely on the written benchmark. And there the changes are marginal.
There are way more allocations happening in the scope of auditbeat
and my goal is to reduce this. The changes are small but will sum up in total.
friendly ping @andrewkroh && @efd6
friendly ping @andrewkroh && @efd6
friendly ping @andrewkroh && @efd6 && @adriansr
friendly ping @andrewkroh && @efd6 && @adriansr
The problem I see with the posted benchmark is, that is focusing purely on the written benchmark. There are way more allocations happening in the scope of auditbeat and my goal is to reduce this.
It seems like we should be able to create a benchmark that allocates similarly to Auditbeat and run it for long enough that it demonstrates an improvement either through the benchmark itself or via some other observation like CPU profiles from the benchmark, execution traces, etc. I'm just looking for some evidence that we are getting a benefit here.