sdk-go icon indicating copy to clipboard operation
sdk-go copied to clipboard

Allow custom override of the datacodec registry

Open grayside opened this issue 2 years ago • 10 comments

Problem

I would like to unmarshal events sent with a registered data content type (e.g., application/json) using a custom codec. I do not want to disrupt all events using that content type.

Use cases for this include:

  • Use protojson to unmarshal application/json events for specific types with a protobuf-based schema
  • Build additional transformation logic into a custom codec to inject derived fields
  • Allow shipping experimental changes to a codec running in a production service without impacting all events of a given content type

Proposal

Allow the type definition passed to the datacodec to override the event's stated content type, giving the developer of the type granular control over how it's unmarshaled.

datacodec.Decode takes an interface{}, I propose we inspect it to see if it declares an override on how that type expects to be decoded.

  1. Define a new interface like
 type Overriding interface {
     ContentTypeOverride() string
 }
  1. Implement this interface in the provided type
func (t CustomDataType) ContentTypeOverride() string {
        return "application/x-example+json"
}
  1. Register a codec to application/x-example+json

  2. Modify the Decode function to override the declared content type:

​​func Decode(ctx context.Context, contentType string, in []byte, out interface{}) error {
        if o, ok := out.ContentTypeOverride(); ok {
                contentType = o
        }
        if fn, ok := decoder[contentType]; ok {
                return fn(ctx, in, out)
        }
        return fmt.Errorf("[decode] unsupported content type: %q", contentType)
}

Alternatives Considered

Modify CloudEvent's Data Content Type

Developers with the need to override the unmarshaling behavior modify the content type:

e.SetDataContentType("application/x-example+json")

However, this approach has two downsides:

  • The CloudEvent may not be in the overridden content type yet, the object is misrepresenting it's current state
  • Ownership of this override behavior is better encapsulated with the type definition than with the code using that type

grayside avatar Aug 11 '22 17:08 grayside

Thanks for the well written issue.

I was wondering what you imagine the integration code would look like, ie, what would the sample code be that leverages these overrides? I am trying to understand how you intend to use such a feature (and why it would want to be in the sdk vs handled in custom code with raw data access on the special case).

n3wscott avatar Aug 11 '22 19:08 n3wscott

Making this more specific, suppose I want to unmarshal the data payload of a CloudEvent delivered as application/json to a protobuf.

Type Definition

type ProtobufMessageData struct {
    // ...
}

func (t ProtobufMessageData) ContentTypeOverride() string {
        return "application/jsonpb"
}

Server Developer Experience

This is very little different from the default example.

import (
    cloudevents "github.com/cloudevents/sdk-go/v2"
    "example.com/mydatatypes"
    _ "example.com/cloudevents-go/format/protojson"
)

func receive(event cloudevents.Event) {
    var data mydatatypes.ProtobufMessageData
	if err := event.DataAs(&data); err != nil {
        fmt.Sprintf("event.DataAs: could not parse event as CustomMessageFormat")
    }
}

func main() {
	// The default client is HTTP.
	c, err := cloudevents.NewClientHTTP()
	if err != nil {
		log.Fatalf("failed to create client, %v", err)
	}
	log.Fatal(c.StartReceiver(context.Background(), receive));
}

grayside avatar Aug 12 '22 01:08 grayside

It seems very odd for the target data struct to be able to overload the encoding the cloudevent presently is. When the DataAs method goes to convert the inner data to be whatever you need it to be, it takes the data content encoding and the location of the data into account, and then delegates to the registered masherler for that data content encoding.

How about another approche.... it is rare that someone would not want the jsonpb lib to be used on proto messages. I could see it as useful information to include on the data codec to include both the datacontentencoding and the annotations on the type passed to Decode. We could cache this reflection to make it quick for multiple messages.

And we could expose a marchaler resolving chain hook that codec.``Encode/Decode can use to locate the correct function to handle the data. The interface to that resolving chain could be three parts:

  • The wrapping event's datacontentencoding
  • The known annotations on the struct. (i.e.: xml, json, proto...)
  • and the struct (or reflected struct) to further inspect the annotations (this last one I would need more research and prototyping to get it right)

The end result would look something like this:

  • No changes to ProtobufMessageData
  • No changes to the samples.
  • A new registration for codecs,
​​func Decode(ctx context.Context, contentType string, in []byte, out interface{}) error {
        for _, c := range codecChain {
               if fn, ok := c(ctx, contentType, out); ok {
                        return fn(ctx, in, out)
                }
        }
        return fmt.Errorf("[decode] unsupported content type: %q", contentType)
}

Order would matter on the codec chain so maybe we would need a method that would let you insert a codec resolver to the head or the tail of that list.

There could be another method that does a cached inspection of out looking for go struct annotations, if any, and perhaps that just returns a simple array of no repeated [<annotation key>, ...], like ["json", "protoc", "xml"], and the codec chain resolution function can choose if it cares about these or not.

TL;DR: I think the request is valid, you are asking for more control because there are other things that you might want to do in the path of decoding, and it would be nice if it was hooked into the sdk, so you can bring your own defaults and opinions. I think there is information already in the struct provided we are not using that is important to be able to key off of. My counter proposal is to look into keying off that internal information vs adding an interface to the struct to control (really reflect) the fact these are not just json, they are proto (or whatever the case may be in the future).

Thoughts?

n3wscott avatar Aug 12 '22 16:08 n3wscott

I'm not sure I follow this idea, I believe you are suggesting we don't want supplied types to have an awareness of the CloudEvent SDK, and so instead we modify the codec system to have a concept of codecs self-selecting according to a priority order.

Did I correctly understand?

  • The datacontenttype or struct annotations are used to select which registered codecs to attempt
  • Codecs are applied in a prioritized list
  • The first successful codec cancels attempts at the remaining codecs

For my example, something like the following sequence would apply:

  • application/protojson codec is registered at "high priority"
  • My ProtobufMessageData type has protobuf and json tags
  • A CloudEvent comes in with datacontenttype application/json
  • The codec system asks the protojson codec if it is interested in this data, the codec sees datacontenttype=application/json and that the protobuf tag is on at least one property
  • protojson unmarshals the data
  • codec system returns

grayside avatar Aug 16 '22 16:08 grayside

@grayside Yes, you understand it correctly. And I think we could end up with something that does what you are expecting by default for structs that have the protobuf annotations, and we would add a hook into that codec selection process for the future to allow anyone else to change these rules as they need for their specific use-case.

n3wscott avatar Aug 16 '22 16:08 n3wscott

I've thought this over, and I think your solution has cleaner separation of concerns but a lot more work to implement and optimize. I've also seen priority registry patterns like this create ordering conflicts in larger codebases. I'm going to canvas a few other patterns in this comment to see if I can find another approach that makes sense so we have two viable options.

I wonder if there's still a clean way to use a more imperative approach. In the C# SDK, the data type is kept clean, but there's a CloudEventFormatter that is used to wrap that type, and bind the equivalent of a codec to it. This is a bit of an inspiration for my original proposal, and these below:

  1. Override the registry (assuming we send along the type identifier):

     decode.AddDecoderOverride("com.github.pull_request.opened", "application/jsonpb")
    

    Downside: need to take a hard dependency on the SDK wherever this override is applied.

  2. Override the CloudEvent:

     event.AddDecoderOverride("application/jsonpb")
    

    Downside: The need to tweak the CloudEvent for customization can't be entirely hidden.

grayside avatar Aug 18 '22 17:08 grayside

@n3wscott do you have further thoughts on the direction? (If it helps I hope to contribute implementation.)

grayside avatar Aug 24 '22 22:08 grayside

I think we should do the harder what I suggested. If that sounds a like a plan we can get to work. 🙂

n3wscott avatar Aug 25 '22 00:08 n3wscott

Hi @n3wscott , this ended up at larger complexity than I initially thought and I am considering the timing of when pursuing this makes sense. Thanks for helping identify a path forward!

grayside avatar Sep 08 '22 21:09 grayside

Hi all.. I've just come across this issue while looking for a solution to a different problem.

I'm trying to perform custom marshalling for CSV data, where my data is a map[string]string, but I need to output the values as a CSV, in the correct order as defined by some column order (where the keys in the map are the column names).

For context, I am aggregating multiple events into a single batched CSV file, and this step of encoding a CSV record in an event is an interim step. For now I'm just avoiding the DataAs method and marshalling manually, however this isn't a good long term solution for me.

To add my 2¢ to this issue, how about allowing passing a context.Context down into the codec? At the moment the codecs expect a context, however DataAs() is hardcoded to context.Background(). If this could be propagated somehow, you could then override the default application/json codecs with your own implementation, which uses a context value to determine what to do - including falling back to the official implementation?

In your case, you could choose to marshal using your custom approach based on there being a schema on context, or for your A/B testing example some sample on tracing contexts etc.

In my case, I can hold column names in the correct order in a []string on context, and use that in my encoder/decoder.

dan-j avatar Nov 17 '22 18:11 dan-j