cilium icon indicating copy to clipboard operation
cilium copied to clipboard

ipsec: cache xfrm state list

Open marseel opened this issue 1 year ago • 19 comments

Reduces GC CPU usage and memory allocations coming from XfrmStateList. To ensure we have up-to-date cache, wrap all XfrmState related functions inside cache, which is invalidated whenever XfrmState changes.

This is follow-up to https://github.com/cilium/cilium/pull/32577 While that PR averages out CPU usage over time, in large cluster 100+ nodes amount of allocations coming from netlink.XfrmStateList() is high due to backgroundSync where we usually don't change any Xfrm states. This becomes more and more expensive as number of nodes increases.

Some pprofs/GC CPU time graphs. Before: Number of alloc objects: image Total GC time for 100 nodes: image After: Number of alloc objects: image Total GC time for 100 nodes: image

This would become even more important for larger clusters, as XfrmStateList allocations in backgroundSync are essentially O(n^2).

ipsec: Improve CPU usage of cilum-agent in large clusters

marseel avatar May 16 '24 15:05 marseel

/test

marseel avatar May 17 '24 12:05 marseel

/test

marseel avatar May 17 '24 12:05 marseel

@pchaigno I wonder which CI test runs IPSec privileged test? I wanted to double-check it runs, I was thinking that probably Conformance Runtime (ci-runtime) but junit shows 0 for privileged:

<testsuite name="github.com/cilium/cilium/pkg/datapath/linux/ipsec" tests="0" failures="0" errors="0" id="118" hostname="kind-bpf-next" time="0.130" timestamp="2024-05-17T12:41:11Z"></testsuite>

:shrug:

marseel avatar May 17 '24 13:05 marseel

@pchaigno I wonder which CI test runs IPSec privileged test? I wanted to double-check it runs, I was thinking that probably Conformance Runtime (ci-runtime) but junit shows 0 for privileged:

<testsuite name="github.com/cilium/cilium/pkg/datapath/linux/ipsec" tests="0" failures="0" errors="0" id="118" hostname="kind-bpf-next" time="0.130" timestamp="2024-05-17T12:41:11Z"></testsuite>

But it prints tests="0" for every single packages, doesn't it? I'm not sure we can rely on that. In the output, it does look like it's doing something: https://github.com/cilium/cilium/actions/runs/9128109980/job/25099903131#step:17:367.

pchaigno avatar May 17 '24 13:05 pchaigno

@marseel @pchaigno I originally reported this issue and I've been playing around with a patch on my side also. Instead of calling xfrm state list here and looping until we find a match, can we just use XfrmStateGet instead? This seems to find states just fine in my testing and avoids getting the whole list of states back in the first place.

I haven't tested the impact of this on the GC on a large cluster however, but wondering why we wouldn't want to do this instead of the looping logic?

I'm a bit wary of caching the states in general just because it feels like troubleshooting issues with the cache in the future may be difficult or hard to find.

sjdot avatar May 17 '24 18:05 sjdot

@sjdot I do agree that we should use XfrmStateGet and probably this part needs some refactoring in general.

While XfrmStateGet would work for background-sync triggered events which usually don't change anything, I am a bit worried about interleaving for other cases:

  • we use states list later in xfrmTemporarilyRemoveState and xfrmDeleteConflictingState which are fetched before performing any initial update/delete/add etc.
  • We ignore errors from XfrmStateDel in the initial part

I would say adding caching should be safer than refactoring that part (also safer for backports), as long as we ensure we don't have other calls that are modifying xfrmState with CI test.
If something external to Cilium modifies xfrmStates, then caching for 1 minute won't make any significant difference, as this is going to be only resolved by background-sync, which runs less often than 1 minute.

marseel avatar May 21 '24 10:05 marseel

Thanks @marseel

Yeah in my patch there are still cases where I have to call list and pass to the functions you mentioned (e.g. a state that needs to be deleted an re-added), I've also not handled the migration of old state so I agree back porting is painful/difficult.

I see the usefulness of the caching given this, I'm just nervous given our long history of very painful ipsec issues. Given it's a pretty significant change, do you think it would be feasible to put this behind a feature flag initially? I would like the ability to turn this off if we hit an issue we suspect might be related to be able to quickly verify if this is the cause or not.

sjdot avatar May 21 '24 15:05 sjdot

I see the usefulness of the caching given this, I'm just nervous given our long history of very painful ipsec issues. Given it's a pretty significant change, do you think it would be feasible to put this behind a feature flag initially? I would like the ability to turn this off if we hit an issue we suspect might be related to be able to quickly verify if this is the cause or not.

Yup, I think that definitely makes sense. Let me get that flag + CI coverage before merging.

marseel avatar May 21 '24 15:05 marseel

So I've tested it manually as well to make sure that flag works, we can disable caching by following helm values:

extraArgs:
  - --enable-ipsec-xfrm-state-caching=false

I intentionally made this flag hidden as I don't expect that we will need to turn it off.

For backports, I am considering turning it off as default and switching to true by default later on once we get more confidence on main branch.

marseel avatar May 22 '24 10:05 marseel

/test

marseel avatar May 22 '24 10:05 marseel

I just checked and it seems we don't run privileged runtime tests on main: PR for testing: https://github.com/cilium/cilium/pull/32686 Succeded: https://github.com/cilium/cilium/actions/runs/9208640026/job/25331399284#logs

I've checked that while trying to make a backport to hotix branch I've noticed that it fails: https://github.com/cilium/cilium/pull/32665 as we run tests there on Jenkins.

@sayboras is looking into fixing running these tests on main.

marseel avatar May 23 '24 13:05 marseel

@sayboras is looking into fixing running these tests on main.

Sorry for the trouble, this was missed as part of recent go test migration, I have scanned for all the files, seems like only two test suites are missing. The below PR should help.

https://github.com/cilium/cilium/pull/32687

sayboras avatar May 23 '24 14:05 sayboras

Converting to draft till tests are being correctly run.

marseel avatar May 23 '24 15:05 marseel

/ci-runtime

marseel avatar May 23 '24 16:05 marseel

/ci-runtime

marseel avatar May 23 '24 16:05 marseel

/ci-runtime

marseel avatar May 23 '24 17:05 marseel

Rebased to pick up https://github.com/cilium/cilium/pull/32687

marseel avatar May 24 '24 10:05 marseel

/test

marseel avatar May 24 '24 10:05 marseel

@borkmann friendly ping :)

marseel avatar May 29 '24 13:05 marseel