cue: json encoding can use O(n^2) space
What version of CUE are you using (cue version)?
$ cue version
cue version v0.0.0-20230628162133-7be6224cbc4f
go version devel go1.21-f90b4cd655 Fri May 26 03:21:41 2023 +0000
-buildmode exe
-compiler gc
DefaultGODEBUG panicnil=1
CGO_ENABLED 1
GOARCH amd64
GOOS linux
GOAMD64 v1
vcs git
vcs.revision 7be6224cbc4f99e9caee308a9218f11654dbb621
vcs.time 2023-06-28T16:21:33Z
vcs.modified false
Does this issue reproduce with the latest stable release?
Yes
What did you do?
I ran this benchmark:
package main
import (
"strings"
"testing"
"cuelang.org/go/cue/cuecontext"
)
func BenchmarkJSONMarshal(b *testing.B) {
b.ReportAllocs()
n := 10000
c := strings.Repeat("[", n) + "0" + strings.Repeat("]", n)
ctx := cuecontext.New()
val := ctx.CompileString(c)
if err := val.Err(); err != nil {
b.Fatal(err)
}
b.ResetTimer()
for i := 0; i < b.N; i++ {
data, err := val.MarshalJSON()
if err != nil {
b.Fatal(err)
}
_ = data
}
}
What did you expect to see?
Memory usage roughly proportional to the size of the data processed (20KB).
What did you see instead?
cpu: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz
BenchmarkJSONMarshal-8 2 846455788 ns/op 330375888 B/op 180538 allocs/op
The encoding is allocating 315MB for each JSON marshal operation, and it takes 845ms to run.
The reason is that the internal CUE logic for marshaling does not write to a single buffer, but instead
it creates a buffer for each nested value which is then discarded after appending to the outer buffer.
See cue.marshalList for an example of this kind of behaviour.
Although the example above is somewhat pathological in nature, all CUE values with nested structure will suffer from this issue to some extent.
Here's a graph illustrating allocated bytes vs n for json and yaml encoders:
Here it is with a log y-axis, which seems to tend towards linear, showing the n^2 behaviour.
On one hand we're limited by the encoding/json API, since it doesn't allow reusing buffers in any way. On the other hand, most of this is our own fault, since we encode structs and lists via recursive calls to the MarshalJSON method, forcing the use of new buffers. So we could definitely do a lot better, although right now we're working within the constraints of the encoding/json API.
I also think it would be worthwhile to have a benchmark with relatively large cases for each value kind (e.g. long string, long list, big struct, deeply nested list, deeply nested struct) and get memory use numbers for each of our supported encodings. Not just for YAML, but for others we'll likely support in the future.