cue
cue copied to clipboard
evaluator: parsing antler CUE configs can exhaust system memory
What version of CUE are you using (cue version)?
$ cue version
cue version v0.5.0
go version go1.23.0
-buildmode exe
-compiler gc
DefaultGODEBUG asynctimerchan=1,gotypesalias=0,httplaxcontentlength=1,httpmuxgo121=1,httpservecontentkeepheaders=1,netedns0=0,panicnil=1,tls10server=1,tls3des=1,tlskyber=0,tlsrsakex=1,tlsunsafeekm=1,winreadlinkvolume=0,winsymlink=0,x509keypairleaf=0,x509negativeserial=1
CGO_ENABLED 1
GOARCH amd64
GOOS linux
GOAMD64 v1
Does this issue reproduce with the latest stable release?
Yes. It's the same or possibly worse in v0.10.0, with or without CUE_EXPERIMENT=evalv3.
What did you do?
I created a CUE package for Antler in this sce-tests repo. This is an Antler test config with 216 tests, that uses both large lists (generated programmatically with Go templates), and CUE list comprehension, that likely results in a large CUE graph.
The biggest culprit, it seems, is the Run list for my FCT tests. This creates list of 1200 elements (using a Go template), which is used with list comprehension to generate StreamClients. When Antler goes to unify the schema with the config using the CUE API using the CUE API, the process memory reported by top rises very quickly, and can complete exhaust the system memory, depending on the hardware. If I comment out this list, it's still slow and uses a lot of system memory compared to what I'd hope for, but it's at least much faster.
To reproduce it, one can install Antler, pull the sce-tests repo, and run antler vet to parse the config. My hope is that this isn't necessary for you to do, and just based on the description, you can identify the category of performance problem referred to in the Performance umbrella issue, so I have a sense of if or when this may be improved.
Also, I might be able to work around this by avoiding large lists, but it's flexible for users to provide their own statistical distribution of wait times and flow lengths, and these lists can simply get long. On top of that, this project will eventually at least triple in size with more tests, so I'll have to solve this somehow, and am just looking for advice. Would this be any better in v0.11.0-alpha.1, or with any other config options?
What did you expect to see?
The config to parse reasonably quickly.
What did you see instead?
Excessive memory allocations.
A Linux laptop with 8G of RAM and 8G of swap runs out of memory entirely when parsing the config.
Another box with 16G of RAM and 8G of swap is able to parse the config without running out of memory, but just barely.
Is it possible to reproduce this slowness via the cue command on your sce-tests repo, for example via cue eval or cue export?
Yes, although it takes a few steps:
- Pull the sce-tests repo.
- Add the file fct.cue from the attached fct.cue.gz, which comes from the
antler vetcommand, but is attached here so you don't have to install antler to generate it. - Edit
sce.cueto uncomment the section under "polya fct tests", which I currently have commented out to prevent the memory problem. - To get the config schema, copy the file config.cue into the same directory and change its package to "sce".
- Run
cue export.
When running cue export, that will all be run on static cue files outside of antler, and it shows the same memory problem. Be prepared though that a machine with 8G might become unusable and need a hard reboot. A machine with 16G should be able to handle it.
I upgraded my box to 64 GB RAM and did some testing with CUE v0.5.0 and CUE v0.11.0. This is the resident memory reported after the config is completely parsed:
CUE v0.5.0: 9.6 GB CUE v0.11.0: 22.6 GB CUE v0.11.0 with CUE_EXPERIMENT=evalv3: 39.2 GB
I appear to be stuck on v0.5.0. Or, are there any other experimental flags I can try?
That's it for now. We still have some performance and memory usage work to be done on evalv3, so that's still our focus for issues like this one.
Just adding to this that I reduced the CPU and memory considerably by removing all the disjunctions I was using in my config schema.
CUE v0.5, with disjunctions:
54.40user 2.69system 0:29.84elapsed 191%CPU (0avgtext+0avgdata 10471328maxresident)k
0inputs+56outputs (0major+2691891minor)pagefaults 0swaps
/bin/time antler vet 54.40s user 2.70s system 191% cpu 29.849 total
CUE v0.5, without disjunctions:
6.76user 0.41system 0:04.37elapsed 164%CPU (0avgtext+0avgdata 1487272maxresident)k
0inputs+56outputs (0major+386054minor)pagefaults 0swaps
/bin/time /tmp/antler vet 6.77s user 0.42s system 164% cpu 4.374 total
CUE v0.11, without disjunctions, without CUE_EXPERIMENT=evalv3:
27.54user 1.20system 0:16.08elapsed 178%CPU (0avgtext+0avgdata 4745528maxresident)k
0inputs+56outputs (0major+1255867minor)pagefaults 0swaps
/bin/time /tmp/antler vet 27.55s user 1.21s system 178% cpu 16.090 total
CUE v0.11, without disjunctions, with CUE_EXPERIMENT=evalv3 (however this got a "field not allowed" error on a line number that doesn't make sense yet, so I'll try to sort this out later):
12.57user 1.32system 0:05.37elapsed 258%CPU (0avgtext+0avgdata 5138912maxresident)k
0inputs+56outputs (0major+1305055minor)pagefaults 0swaps
CUE_EXPERIMENT=evalv3 /bin/time /tmp/antler vet 12.58s user 1.33s system 258% cpu 5.375 total
At least this shows that in my case, the way I'm using disjunctions (to enforce that only one field is set in a struct, and those structs are themselves used inside of recursive structs) is a pretty big portion of the resource consumption. I can make a workaround for this, but also look forward to things returning to a v0.5 level of performance, or better one day. 🤞
Don't spend energy reducing that one "field not allowed" error, as we are aware of regressions in that space: https://github.com/cue-lang/cue/issues/3601
We are going to continue with the performance work once these known regressions are fixed. Follow https://github.com/cue-lang/cue/issues/2850 for updates :)
Don't spend energy reducing that one "field not allowed" error, as we are aware of regressions in that space: #3601
We are going to continue with the performance work once these known regressions are fixed. Follow #2850 for updates :)
Ok, that's a time saver, thanks.