RFC: de/serialization improvements
More than once I've been left wanting for better encoding/decoding of Value's in CUE's go API, so I wanted to open this issue to get some feedback on some improvements.
cue-specific Unmarshaller / Marshaller interfaces
I've already submitted f719cc4 to gerrithub for review to add a Unmarshaller interface to cue package, one to add a Marshaller interface that can be utilized by context.Encode will be there shortly as well. While the existing value.Decode method does a very good job, there's many limitations I've hit that can't be resolved without some more control over the process:
- structs that have interface types (or slices or maps of them, of course) require custom handling when unmarshalling, while a
"json:,ignore"tag works around it flat out failing you now can't simply rely on being able to call value.Decode and have a complete object graph. - when you do opt to just go the
"json:,ignore"route, other tools likecue get gono longer pay attention to the field.
new (or modified?) struct tag for additional un/marshalling features I'll link related issues when I find them, but there's two features in particular that I've personally wanted and have seen mentioned elsewhere.
While value.Decode does identify struct fields that are cue.Value's and simply sets them, allowing users to handle data that may be incomplete or deal with some edge cases like disjunctions (though the latter is mostly addressed by supporting custom Unmarshallers), this functionality could be obviously improved in two ways:
- Allow tagging a field to receive the cue.Value that was unmarshalled. This lets users have access to everything about the original struct, maximum control if there's data that can't easily be shaped or if they don't want to lose things like the underlying constraints for later validation, whatever.
- Allow tagging a field to receive fields of a struct that weren't placed elsewhere. This could either be a
map[string]anyor amap[string]cue.Value. There are many use cases where one may have open definitions, and at present there's not a good deal with them (you can implementjson.Unmarshaller, but if you have an open definition then who knows if you can expect other values to be complete or not and break when the API tries to first marshal the value to JSON).
While I'm at it, I'm sure more than one person has wished they could have CUE or JSON pay attention to a field in a struct but not the other, or to have the names vary, or to ignore a field for the JSON representation or use it with CUE, etc. Might as well have a dedicated struct tag for configuring un/marshalling in general, I feel.
There is already the cue tag, but I can't think of a good way to share it that's not extremely risky to parse and get things wrong. Suggestions welcome here.
This issue is about two feature requests - Marshaller+Unmarshaller interfaces, and then better struct tags to add features to the marshalling and unmarshalling.
Just so I understand better - do you think either of these is necessary, or both in combination?
It should be noted that the existing API and semantics (mapping fields, struct field tags, struct embedding, etc) are heavily based on Go's encoding/json package, and now encoding/json/v2 is becoming its replacement: https://github.com/golang/go/issues/71497
My 2c would be that we should consider rewriting our cue.Value marshalling and unmarshalling in terms of the proposed v2 semantics and code. It already includes many features which we'd find useful, and which are also related to your requests:
- options to set type-specific marshallers and unmarshalers
- options to either reject, discard, or add unknown members to a marked map or raw value field
- the ability to mark a field as "inline" to promote it when encoding/decoding without needing to embed a Go struct
- the ability to "skip" a marshal or unmarshal entirely via special sentinel errors
I'm not necessarily suggesting that this is all-or-nothing, but the existing design is rather limited given that it's based on encoding/json v1 from over a decade ago. encoding/json v2 has had lots of design work and careful thought put into it, so I'd prefer that we borrow some of its ideas for a new or refined API rather than try to come up with our own piecemeal improvements over the existing API.
@mvdan While there's a lot of components of the json/v2 work that are wholly unnecessary, given we aren't dealing with mapping to/from a serialized text form, I agree that there's a lot to take from the idea.
From what I'm seeing, however, the design of the Unmarshaler/Marshaler interfaces are remaining the same, and the Encoder/Decoder structs aren't of particular value to CUE given we're working with an in-memory AST representation rather than writing to/from text buffers. If I'm not mistaken, that pretty much leaves the updated options in the struct tags, and options to be passed to Encode/Decode (which can be implemented in an API compatible way by making both take a vararg Option set) as well as some sentinel error types, is that correct?
I'm fully onboard with the idea, if my assumptions here are correct.
My understanding is similar. We already have some form of encode options, but we have no decode options. Whatever interfaces we add for custom marshal/unmarshal on types, those must leave room for options to be passed along, because otherwise we'll be stuck later on with a hard limitation.
json/v2 uses an opaque options type, where constructors such as https://pkg.go.dev/github.com/go-json-experiment/json#DiscardUnknownMembers are exposed, and then a getter is available at https://pkg.go.dev/github.com/go-json-experiment/json#GetOption. I'd personally follow a similar model, because a simple struct with exported fields gives too much flexibility to users. We can likely reuse the existing EncodeOption, although note that json v2 has "joined" options functionality for the sake of not having to process a variadic list of options at every step of the way.
If/when we want to start actually supporting decode options, we can create a new entrypoint API accepting those, given that Decode currently takes none.
And then yes, as you say, the other json/v2 improvements we could borrow are the improved struct tags and the sentinel errors. However, I should note that the best way to do this would be to scrap the existing encode/decode we borrowed from json v1, and start again with the latest json v2 code, which already supports those. Trying to backport the features on top of a rather old and modified copy of json v1 is likely to cause multiple bugs and issues.
I agree that it would be wise to leave the existing code path (for the time being, at least), but I would propose we keep Decode as the one and only entrypoint. Since we only have to care about API compatibility for Go, adding varargs to value.Decode will continue to work without any options being provided - and in that case we can simply use the existing code path (we could add a UseV2 option as well, and note in the API docs for any new functionality it will silently be ignored without it until we're sure it's stable enough to be the default).
We can also replace DecodeOption with a shared type, as the existing NillIsAny option wouldn't be harmed being a no-op when passed to Encode(). For what it's worth, given the ability to 'join' options together easily, I think this is actually a fantastic idea since one could construct ahead of time a single set of options and then reuse them anywhere they convert to/from cue. This also matches the design of json/v2, they just indicate in the docs whether options affect one, or both.
The only thing I'm having a hard time with is the hardest problem at all in programming, naming things. But that said, here's how I envision the public API, everything else is internal implementation.
type ArshalOption interface {
// This is a placeholder signature, point is the function is unexported so
// t cannot be implemented outside the containing package
provideArshalValue() any
}
type EncodeOption = ArshalOpton
func NilsIsAny() ArshalOption {}
func (c *Context) Encode(x any, options ...ArshalOption) Value {}
func (v Value) Decode(x any, options ...ArshalOption) error {}
Encode already takes a vararg of EncodeOption's, we're just changing it to take the new ArshalOption type instead and adding a type alias to the old EncodeOption to preserve the API (do we really need to do this? I can't think of many cases people would, at present, store the one and only option that's there right now). Decode only takes a single argument right now, so adding the vararg options will not break the API.
Internals get a bit trickier based on how far down the rabbit hole you want to go, I've got some ideas on improving the design of the Option API compared to what json/v2 is proposing. Either way, (almost) all of the code for Decode() is nicely wrapped up in the existing decoder struct, and Encode wouldn't be hurt doing the same - we'd just switch whether we create v1 or v2 by the presence of an option that enables v2 for the present.
With that being said, the Unmarshaler/Marshaler interfaces I submitted patchsets for and their behavior would remain unchanged - anything additional I'm seeing in json/v2 is around control over the default behavior without custom Marshal/Unmarshal implementations as well as the Encoder stuff which isn't relevant to us. As such, I would still propose asking you to review those patches since they're low-hanging fruit that would be immediately useful as a stop-gap (the only behavior change is looking for an additional interface before trying the existing ones).
Beyond that, I'd be happy to start working on a patch series for a v2 Encode()/Decode() implementation based on what json/v2 is doing. I know there's also multiple CUE-specific ideas in the issues already that could be brought in (making sure objects conform to a schema when being unmarshalled, etc.)
I have the following change set staged on gerrithub for review to at least implement the new Marshal/Unmarshal interfaces, if you'd like to take a look. If we can come up with some specifics on options that would be useful for a v2 interface, I'm happy to continue working on things - but this is low hanging fruit that resolves some needs of my own and I'm sure plenty of others would find it useful as well in the current state. The second change is optional and I can rebase the third without it, but I figured while I was already in the decode logic some cleanup wouldn't be unwelcome.
1219262: cue: add support for custom CUE unmarshalling via Unmarshaler interface | https://review.gerrithub.io/c/cue-lang/cue/+/1219262 1219278: cue: refactor how Decode handles custom unmarshalers | https://review.gerrithub.io/c/cue-lang/cue/+/1219278 1219266: cue: add support for custom CUE marshaling via Marshaler interface | https://review.gerrithub.io/c/cue-lang/cue/+/1219266