pkg/list: Sort is too slow
Originally opened by @vikstrous2 in https://github.com/cuelang/cue/issues/745
What version of CUE are you using (cue version)?
$ cue version v0.3.0-beta.4 linux/amd64
Does this issue reproduce with the latest release?
yes
What did you do?
Baseline:
root.cue
package kube
import "list"
_a: [{name: 1}, {name: 3}, {name: 4}, {name: 5}, {name: 6}, {name: 7}, {name: 8}, {name: 9}, {name: 0}]
_b: [_a, _a, _a, _a, _a, _a, _a, _a, _a, _a]
_c: [_b, _b, _b, _b, _b, _b, _b, _b, _b, _b]
_d: [_c, _c, _c, _c, _c, _c, _c, _c, _c, _c]
out: list.FlattenN(_d, 4)
Run:
$ time cue eval root.cue | wc -l
0.34user 0.01system 0:00.20elapsed 172%!C(MISSING)PU (0avgtext+0avgdata 68940maxresident)k
0inputs+0outputs (0major+1293minor)pagefaults 0swaps
18001
Replace the last line with this list.Sort command to see the performance impact of using list.Sort in this config:
out: list.Sort(list.FlattenN(_d, 4), {x: {}, y: {}, less: x.name< y.name})
Run:
$ time cue eval root.cue | wc -l
6.03user 0.07system 0:02.87elapsed 212%!C(MISSING)PU (0avgtext+0avgdata 72392maxresident)k
0inputs+0outputs (0major+2495minor)pagefaults 0swaps
18001
What did you expect to see?
negligible performance difference
What did you see instead?
10x slow down
Original reply by @mpvl in https://github.com/cuelang/cue/issues/745#issuecomment-775727746
Is this something you found in practice or just when playing around? We are aware of slowness in various places, but we tend to prioritize actual use cases.
Original reply by @vikstrous2 in https://github.com/cuelang/cue/issues/745#issuecomment-777467068
The use case is outputting a predictable yaml output when generating kubernetes configs so that we can diff the produced changes on PRs. It turned out that a few simple sorts are actually not enough for this anyway, so I'm using this yq command to clean the output of cue and make it git diffable: yq -s -S --yaml-output '.| sort_by(.metadata.name) | sort_by(.kind) | .[] | (.spec.template.spec.containers |select(.!=null) | .[].env |select(.!=null)) |= sort_by(.name) | (.spec.template.spec.containers |select(.!=null) | .[].volumeMounts |select(.!=null)) |= sort_by(.name) | (.spec.template.spec.volumes |select(.!=null)) |= sort_by(.name)'
Original reply by @mpvl in https://github.com/cuelang/cue/issues/745#issuecomment-780523037
Would it be useful to have a semantic diff? There is an internal diff package in the CUE, but we could expose it as a package and or command. If the latter, you could comment on Issue #8 on how you think this would best look like.
Original reply by @vikstrous2 in https://github.com/cuelang/cue/issues/745#issuecomment-816727642
I'll comment on #8 because that's closer to my use case.
Update: We've been running the above yq command on the output of CUE for a while now, but there are ~500 files to sort and now it takes longer to run yq so many times than it takes to generate the output in the first place. If we had associative lists and those were output in a sorted order, that would solve this issue. I tried to make my own associative lists with a for comprehension. I see that maps are output in a sorted order, but for comprehensions (li: [for x in obj {x}]) are not, so I'm still running into the issue with sorting being slow because I still need to sort the output of the for comprehension.
To be fair, I haven't tried applying the associative list + sort pattern everywhere because I'm worried that I'll run into the same performance issue after refactoring my config.
@vikstrous2 do you have an example where the output is not stable? (https://github.com/cue-lang/cue/wiki/Creating-test-or-performance-reproducers)
CUE uses a topological sort due to conjunctions and such. In my experience it is stable with both JSON and Yaml output of kubernetes manifests, though not lexicographically sorted.
You might also try v0.4.0 but this has been the behavior in the 0.3 series as well (iirc)
Sorry for the confusion. The issue is not that the output is not stable between runs with the same config. The issue is that refactors cause fields to be reordered even though it's rendering equivalent configs. I realized now that actually maps are output in the same order the are defined.
The following two configs should produce the same output because the are equivalent:
b: "b"
a: "a"
a: "a"
b: "b"
The fact that they produce different output makes it infeasible to review the diff in the generated output when refactoring. (when working with large configs where there might be 1000 changes that are just noise)
@vikstrous2 thanks for the additional detail.
As noted in #474 we are looking to tighten up this behaviour in the spec, so that should help.
I realized now that actually maps are output in the same order the are defined.
Indeed, that should be the case above, despite the fact this is not guaranteed in the spec. So given this is the case, are you running into a case where this is not happening? i.e. why are you finding that sort is required? Without a sort the output might well differ from the sorted result, but if you accept that diff then from that point on the diff should be "true", no?
Note that I'm not pushing back on the fact that we should fix the speed of sort; that we should fix it is a given (and thank you for the reproducer above). Rather, I'm trying to see if there is a practical workaround for now.
It sounds like you have CUE as the source of truth - do you even need to review the diff of the yaml output? I ask that question, again not to undermine the need for a semantic diff package, but to understand more about your use case.
For ease of reproduction, here is a testscript txtar reproducer
exec time -f '%U user' cue eval root1.cue -o out1.yaml
exec time -f '%U user' cue eval root2.cue -o out2.yaml
-- root1.cue --
package kube
import "list"
_a: [{name: 1}, {name: 3}, {name: 4}, {name: 5}, {name: 6}, {name: 7}, {name: 8}, {name: 9}, {name: 0}]
_b: [_a, _a, _a, _a, _a, _a, _a, _a, _a, _a]
_c: [_b, _b, _b, _b, _b, _b, _b, _b, _b, _b]
_d: [_c, _c, _c, _c, _c, _c, _c, _c, _c, _c]
out: list.FlattenN(_d, 4)
-- root2.cue --
package kube
import "list"
_a: [{name: 1}, {name: 3}, {name: 4}, {name: 5}, {name: 6}, {name: 7}, {name: 8}, {name: 9}, {name: 0}]
_b: [_a, _a, _a, _a, _a, _a, _a, _a, _a, _a]
_c: [_b, _b, _b, _b, _b, _b, _b, _b, _b, _b]
_d: [_c, _c, _c, _c, _c, _c, _c, _c, _c, _c]
out: list.Sort(list.FlattenN(_d, 4), {x: {}, y: {}, less: x.name < y.name})
which with 0f53054d gives:
> exec time -f '%U user' cue eval root1.cue -o out1.yaml
[stderr]
0.70 user
> exec time -f '%U user' cue eval root2.cue -o out2.yaml
[stderr]
8.87 user
Reasonable questions. Thanks for asking them.
So given this is the case, are you running into a case where this is not happening?
No, I'm not running into cases where the output order is unexpected. The order just changes between refactors because fields are often moved around. Sometimes there's no way to do a refactor without moving fields around and it's hard / tedious to split it up into two refactors where one of them keeps the order.
Without a sort the output might well differ from the sorted result, but if you accept that diff then from that point on the diff should be "true", no?
Yes, we can accept the diff, of course. But we'd have to do it blindly unless we go through it and manually compare every changed line to make sure nothing unexpected sneaks through the review. This is hard if the diff is 1000 lines, but only 1 meaningful change is expected.
Note that I'm not pushing back on the fact that we should fix the speed of sort
I actually think that if the output of maps was sorted and that also led to for comprehensions being sorted, that might be a better solution than having to manually sort the output. I don't know if this is feasible or if it's a complete solution (there might be a lot of for comprehensions that interact with each other and might break sorting anyway).
do you even need to review the diff of the yaml output?
Definitely. It acts as a test that can be verified easily. A lot of engineers are not proficient at writing CUE and it's easy to accidentally make mistakes that can be caught when reviewing the generated output. We don't render every possible config (we generate different things for different environments), but we render one example config and check that in so it's part of the PR review.
I commented on the semantic diff issue, but I'm thinking that it doesn't provide significantly more value than diffing well sorted and normalized yaml files.
I actually think that if the output of maps was sorted and that also led to for comprehensions being sorted, that might be a better solution than having to manually sort the output.
This is definitely an option, but I don't think it's necessarily the right default.
A lot of engineers are not proficient at writing CUE and it's easy to accidentally make mistakes that can be caught when reviewing the generated output.
A very good point.
I commented on the semantic diff issue, but I'm thinking that it doesn't provide significantly more value than diffing well sorted and normalized yaml files.
Generally speaking I think diff will be more useful, because it is a semantic diff across different encodings. But I can see how in your case a more specific diff would be more useful.
Thanks very much for the additional colour.
There is some relation here to https://github.com/cue-lang/cue/issues/1180, in so far as we probably need to provide more import/export options.
list.Sort should now be significantly faster. There are still significant possible optimizations if the values for T (or x and y) are idempotent. But I hope for now this will be of help.
Dropping the now-defunct v0.4.x milestone from this issue, but leaving the zGarden label such that we come round to considering what milestone this should sit in.