kopium
kopium copied to clipboard
Add a brute force option to re-use identical structs
Currently, it's impossible for kopium to detect similar structs re-used at different hierarchies because the schemas all have to be inlined and we cannot use $ref in schemas. https://github.com/kubernetes/kubernetes/issues/62872
Suggestion
We should try to brute-force compare all structs (field names + types) to all previously seen structs in the analyzer to deduplicate so that we don't generate 3 different structs for things such as livenessProbe, resources, MatchLabels (under different names).
We already have the skeletons of such code, but it only checks for a few known ones like Condition here and it's enabled toggled under a flag (default enabled) --no-condition.
If we pass existing state into the extractor functions (such as extract_container) we can then inspect current state before adding duplicates (ensuring we only take the first instance of a struct) via something like is_duplicate_struct. It would also avoid having an awkward O(n^2) algorithm at the end.
My thinking is that it might not be a particularly heavy operation on (at most) a few hundred kB of schema data.
Limitations
This would not solve the problem where we are generating a type for something know; e.g. something like Condition might be deduplicated with this algorithm alone, but it would not point to the correct upstream type in e.g. k8s-openapi without futher logic.
This is probably fine however; it allows overrides to be done more easily (change / elide one struct rather than find and replace multiple), and it's a good stepping stone towards identifying them with upstream variants (such as is done for Condition).
More Context
- https://github.com/kube-rs/gateway-api-rs/issues/38
- https://github.com/kube-rs/kube/discussions/1667
@clux please assign me, I'm ready to have a go at this. 🖖
As written, this is valuable for https://github.com/kube-rs/gateway-api-rs/issues/38. Notably however, projects like Gateway API often have types that are shared between multiple CRDs.
As such I would like to do some experimentation with adding support for handling multiple CRDs at once, searching them collectively for duplicates, and emitting multiple modules. Specifically: I would like to look into taking the current generate/analyze/struct loop process and breaking it out to run multiple times for each CRD, with a fan-out/fan-in step added in the midst to collect and compare types between all CRDs, and then emitting a special shared_types.rs module followed by the dependent ${CRD_PLURAL}.rs modules, e.g.:
.
└── src
└── gateway
├── gatewayclasses.rs
├── gateways.rs
├── httproutes.rs
├── mod.rs
└── shared_types.rs
This new mode being an opt-in and leaving all the current behavior alone. LMKWYT? 🤔
I think this would be very useful, and a clear upgrade on the original suggestion. Happy to do reviews on sketches.
Your plan sounds sensible to me. A few quick throwaway thoughts on multiple crds;
- this would probably need another input flag point that
conflicts_withbothfileandcrd(in main), but I see no problem with supporting such a mode, be it for an api group, or a listed set of crds (usually find myself generating output based on a subset of crds in a group for controllers). - possibly helpful to have another super-struct in
output.rsforSchemaor something that can encapsulate everything you find for one crd (instead of the simpleOutputwe currently have), then any algorithm can run on a set ofSchemaobjects
Due to time constraints I haven't been able to go at this in earnest and complete a PR. At this point too, we have a workaround for gateway-api-rs. So for now I think it makes more sense to unassign me and let someone else have a go at this if they want. Perhaps if nobody picks it up for a while I can see if I can take another swing at it later.