api icon indicating copy to clipboard operation
api copied to clipboard

Dev/test cycle when updating a Go file is untenably slow

Open mdbooth opened this issue 4 months ago • 1 comments

I made a change to operator/v1alpha1/types_crdchecker.go and executed:

PROTO_OPTIONAL=1 /usr/bin/time make update-codegen-crds verify-codegen-crds API_GROUP_VERSIONS=operator.openshift.io/v1alpha1

This is after applying a version of https://github.com/openshift/api/pull/2530 which ensures that invocations of codegen only target a single group/version.

The output was:

135.50user 301.99system 14:03.37elapsed 51%CPU (0avgtext+0avgdata 469820maxresident)k
12200inputs+195520outputs (22major+17349270minor)pagefaults 0swaps

so 14 minutes. That's not quite as bad as reprinting your punch cards and waiting for time on the mainframe, but it's still not conducive to iterating on a problem you're trying to understand. This is a beefy workstation-spec laptop.

Please can we either make this much faster (a minute or less), or clearly document a more streamlined method of rapidly iterating on a verify failure in a single api.go?

mdbooth avatar Oct 16 '25 11:10 mdbooth

I believe I know what's going wrong here, but I'm not 100% sure. I know from profiling that when running make update-codegen (with https://github.com/openshift/api/pull/2540 applied) codegen spends 33% of its time in the garbage collector, 33% of its time parsing code, and the remainder doing 'other'. There is vast scope here for optimisation gains.

My theory is that we're holding gengo wrong. With #2540 applied we're running 6 generators in (by my hacked up reckoning) 49 group/versions. 2 of those generators (deepcopy and partial-manifests) make independent gengo invocations, so we're invoking gengo 98 times separately on a full update.

We also do our own parsing before running anything. I have a patch to improve the performance of that here: https://github.com/openshift/api/pull/2543. However, afaict this is not used by any of the gengo generators.

Looking at just deepcopy: https://github.com/openshift/api/blob/9f08480a6c2046ad4faf398f4f99a1ae15ef023e/tools/codegen/pkg/deepcopy/generator.go#L81-L82

This is invoked once per api/group and then iterates over its versions, meaning that generateDeepcopyFunctions() is called 49 times.

Every time it is called it executes this code: https://github.com/mdbooth/api/blob/5f5e67647657dea5eb368f50fde6c86a5a7ae26f/tools/codegen/pkg/deepcopy/deepcopy.go#L40-L61

This is a fresh invocation of gengo.Execute() passing in a single gengogenerator.Target: the one for deepcopy. Note that we haven't passed in any of our previously parsed context to this function. Whatever this does with parsed go code, it's going to have to parse it again.

I don't pretend to understand the structure of a gengo execution, but looking at the Execute() function I can see that the first thing it does is create a Context. This appears to be where it would cache previously parsed objects. However it does it, because we're invoking it separately 98 times, I don't see how any 2 invocations can share any state.

In short, I suspect that we're re-parsing everything many, many times, and probably twice for each group/version.

I suspect that an 'easy' win would be to pass all generators to a single invocations of gengo per group/version. Beyond that, I suspect we may be able to do better still and invoke gengo just once for the whole repo.

mdbooth avatar Oct 21 '25 12:10 mdbooth

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot avatar Jan 20 '26 09:01 openshift-bot