go-libaudit icon indicating copy to clipboard operation
go-libaudit copied to clipboard

reorder structs for better memory usage

Open florianl opened this issue 1 year ago • 12 comments

Signed-off-by: Florian Lehner [email protected]

Reorder structs to reduce required memory.

florianl avatar Jul 08 '22 08:07 florianl

:green_heart: Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2022-07-11T06:09:21.143+0000

  • Duration: 3 min 33 sec

Test stats :test_tube:

Test Results
Failed 0
Passed 386
Skipped 40
Total 426

elasticmachine avatar Jul 08 '22 08:07 elasticmachine

I'd like to see what impact this has on time spent in runtime.scanobject.

I've done some experiments with a family of type T1 struct { next *T1; pad [n]byte } and type T2 struct { pad [n]byte; next *T2 } benchmarking the allocation and GC reaping of the allocations and I don't see any reliable real difference between these layouts in terms of either performance or time spent in GC (particularly in scanobject) over a wide range of padding sizes (note the variance on performance and time in scanobject is high).

efd6 avatar Jul 09 '22 06:07 efd6

It looks like auparse.AuditMessage could also be reordered. Do you want to fix that one too?

done :white_check_mark:

florianl avatar Jul 11 '22 06:07 florianl

This is the output of the benchmarking that I did. Apart from very large cases (padding with 2^16 bytes) there is no difference in behaviours between front loaded pointers and rear loaded pointers. There is a significant difference between slab-allocated and element-wise allocation as expected and which I think is more likely the cause of the problem that this is intended to address.

time/op
First1-167.32ms ± 7%
Last1-166.76ms ± 8%
First2-167.30ms ± 8%
Last2-166.85ms ± 7%
First4-167.85ms ± 9%
Last4-168.16ms ± 8%
First8-1638.7ms ± 7%
Last8-1639.0ms ± 4%
First16-16607ms ±80%
Last16-16541ms ± 3%
First1LL-1645.0ms ± 4%
Last1LL-1646.0ms ± 3%
First2LL-1644.2ms ± 3%
Last2LL-1645.3ms ± 2%
First4LL-1647.2ms ± 4%
Last4LL-1649.5ms ± 3%
First8LL-16113ms ± 2%
Last8LL-16165ms ± 5%
First16LL-16709ms ± 2%
Last16LL-161.79s ± 5%

The details of the benchmarks:

There are a family of types (sub N with the numbers above)

type last1 struct {
	pad [1 << N]byte
	n   *lastN
}

type first1 struct {
	n   *firstN
	pad [1 << N]byte
}

and each was then benchmarked with the following for slab allocation (16 used only 1e5 due to memory constraint).

func BenchmarkFirstN(b *testing.B) {
	for i := 0; i < b.N; i++ {
		a := make([]firstN, 1e6)
		for j := range a {
			a[j].n = &a[(j+1)%len(a)]
		}
	}
}

func BenchmarkLastN(b *testing.B) {
	for i := 0; i < b.N; i++ {
		a := make([]lastN, 1e6)
		for j := range a {
			a[j].n = &a[(j+1)%len(a)]
		}
	}
}

and element-wise allocation

func BenchmarkFirstNLL(b *testing.B) {
	for i := 0; i < b.N; i++ {
		const n = 1e6
		h := &firstN{}
		orig := h
		for j := 1; j < n; j++ {
			h = &firstN{n: h}
		}
		h.n = orig
	}
}

func BenchmarkLastNLL(b *testing.B) {
	for i := 0; i < b.N; i++ {
		const n = 1e6
		h := &lastN{}
		orig := h
		for j := 1; j < n; j++ {
			h = &lastN{n: h}
		}
		h.n = orig
	}
}

efd6 avatar Jul 11 '22 06:07 efd6

Looking at the current top 10 functions of auditbeat from Elastic Cloud QA, these small changes will make a difference:

Screenshot from 2022-07-11 10-50-35

As I'm not allowed to (test-)deploy to Elastic Cloud QA, I can't provide a diff atm.

florianl avatar Jul 11 '22 08:07 florianl

I agree that if the changes made a difference that it would be significant given the proportion of the time spent in scanobject, but the change in ordering doesn't seem to as far as I can see. At least not for the sizes that we are dealing with.

$ go test -bench Last8$ -count 10 -cpuprofile last8.pprof
goos: darwin
goarch: amd64
pkg: scanobjord
cpu: Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
BenchmarkLast8-16    	      31	  37112835 ns/op
BenchmarkLast8-16    	      32	  37242561 ns/op
BenchmarkLast8-16    	      31	  38081059 ns/op
BenchmarkLast8-16    	      27	  37187969 ns/op
BenchmarkLast8-16    	      32	  37338076 ns/op
BenchmarkLast8-16    	      34	  35064717 ns/op
BenchmarkLast8-16    	      32	  35461664 ns/op
BenchmarkLast8-16    	      32	  37593525 ns/op
BenchmarkLast8-16    	      31	  38852988 ns/op
BenchmarkLast8-16    	      33	  35373119 ns/op
PASS
ok  	scanobjord	14.901s
$ go test -bench First8$ -count 10 -cpuprofile first8.pprof
goos: darwin
goarch: amd64
pkg: scanobjord
cpu: Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
BenchmarkFirst8-16    	      31	  35894025 ns/op
BenchmarkFirst8-16    	      28	  37102746 ns/op
BenchmarkFirst8-16    	      31	  36628196 ns/op
BenchmarkFirst8-16    	      33	  37609142 ns/op
BenchmarkFirst8-16    	      32	  35530509 ns/op
BenchmarkFirst8-16    	      32	  37586055 ns/op
BenchmarkFirst8-16    	      28	  37761853 ns/op
BenchmarkFirst8-16    	      32	  35432062 ns/op
BenchmarkFirst8-16    	      32	  37318492 ns/op
BenchmarkFirst8-16    	      31	  35174971 ns/op
PASS
ok  	scanobjord	14.623s
$ go tool pprof scanobjord.test last8.pprof  
File: scanobjord.test
Type: cpu
Time: Jul 11, 2022 at 7:00pm (ACST)
Duration: 14.69s, Total samples = 44.07s (299.96%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 40560ms, 92.04% of 44070ms total
Dropped 87 nodes (cum <= 220.35ms)
Showing top 10 nodes out of 43
      flat  flat%   sum%        cum   cum%
   19240ms 43.66% 43.66%    23600ms 53.55%  runtime.scanobject
    4610ms 10.46% 54.12%     4610ms 10.46%  runtime.memmove
    4020ms  9.12% 63.24%     4570ms 10.37%  runtime.heapBitsSetType
    3050ms  6.92% 70.16%     3050ms  6.92%  runtime.memclrNoHeapPointers
    2710ms  6.15% 76.31%     3630ms  8.24%  runtime.gcWriteBarrier
    2330ms  5.29% 81.60%     2330ms  5.29%  runtime.heapBits.bits (inline)
    1850ms  4.20% 85.80%     1960ms  4.45%  runtime.heapBits.next (inline)
    1270ms  2.88% 88.68%    12570ms 28.52%  scanobjord.BenchmarkLast8
     850ms  1.93% 90.61%      860ms  1.95%  runtime.pthread_kill
     630ms  1.43% 92.04%      700ms  1.59%  runtime.pthread_cond_wait
(pprof) quit
$ go tool pprof scanobjord.test first8.pprof
File: scanobjord.test
Type: cpu
Time: Jul 11, 2022 at 7:01pm (ACST)
Duration: 14.50s, Total samples = 43.02s (296.69%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 39640ms, 92.14% of 43020ms total
Dropped 72 nodes (cum <= 215.10ms)
Showing top 10 nodes out of 53
      flat  flat%   sum%        cum   cum%
   18720ms 43.51% 43.51%    23010ms 53.49%  runtime.scanobject
    4710ms 10.95% 54.46%     4710ms 10.95%  runtime.memmove
    4030ms  9.37% 63.83%     4510ms 10.48%  runtime.heapBitsSetType
    2710ms  6.30% 70.13%     3620ms  8.41%  runtime.gcWriteBarrier
    2690ms  6.25% 76.38%     2690ms  6.25%  runtime.memclrNoHeapPointers
    2430ms  5.65% 82.03%     2430ms  5.65%  runtime.heapBits.bits (inline)
    1730ms  4.02% 86.05%     1780ms  4.14%  runtime.heapBits.next (inline)
    1390ms  3.23% 89.28%    12290ms 28.57%  scanobjord.BenchmarkFirst8
     670ms  1.56% 90.84%      680ms  1.58%  runtime.pthread_kill
     560ms  1.30% 92.14%      640ms  1.49%  runtime.pthread_cond_wait

efd6 avatar Jul 11 '22 09:07 efd6

The problem I see with the posted benchmark is, that is focusing purely on the written benchmark. And there the changes are marginal. There are way more allocations happening in the scope of auditbeat and my goal is to reduce this. The changes are small but will sum up in total.

florianl avatar Jul 11 '22 10:07 florianl

friendly ping @andrewkroh && @efd6

florianl avatar Jul 14 '22 06:07 florianl

friendly ping @andrewkroh && @efd6

florianl avatar Jul 21 '22 05:07 florianl

friendly ping @andrewkroh && @efd6 && @adriansr

florianl avatar Jul 29 '22 09:07 florianl

friendly ping @andrewkroh && @efd6 && @adriansr

florianl avatar Aug 04 '22 08:08 florianl

The problem I see with the posted benchmark is, that is focusing purely on the written benchmark. There are way more allocations happening in the scope of auditbeat and my goal is to reduce this.

It seems like we should be able to create a benchmark that allocates similarly to Auditbeat and run it for long enough that it demonstrates an improvement either through the benchmark itself or via some other observation like CPU profiles from the benchmark, execution traces, etc. I'm just looking for some evidence that we are getting a benefit here.

andrewkroh avatar Aug 04 '22 23:08 andrewkroh