spec icon indicating copy to clipboard operation
spec copied to clipboard

handling of datacontenttype is inconsistent

Open kevinbader opened this issue 4 years ago • 52 comments

CloudEvents 1.0

Consider this example, straight from the spec:

{
    ...
    "datacontenttype" : "text/xml",
    "data" : "<much wow=\"xml\"/>"
}

Clearly, data is some structure that has been encoded using the XML format and put into the event as a string (binary). Naturally, I'd assume the same behavior for JSON encoding:

{
    ...
    "datacontenttype" : "application/json",
    "data" : "{\"foo\": \"bar\"}"
}

However, that's doesn't seem to be the case; as the example in the HTTP protocol binding spec shows, the JSON object is not sent in its encoded form but rather nested into the event directly:

{
    ...
    "datacontenttype" : "application/json",
    "data" : {
        "foo": "bar
    }
}

Note that removing the optional datacontenttype attribute doesn't change this, as the spec clearly states:

A JSON-format event with no datacontenttype is exactly equivalent to one with datacontenttype="application/json".

To sum it up, it is not possible to put a JSON-encoded data blob into a CloudEvent; and a parser needs to treat application/json different than any other datacontenttype.

HTTP Protocol Binding 1.0

For structured content mode, the spec says:

The chosen event format defines how all attributes, and data, are represented.

Does this mean that datacontenttype must be present and set to the event format? Or does structured mode implicitly change the default of datacontenttype from application/json to whatever event format is in use? What if datacontenttype is present and set to a different encoding - must a parser treat this event as malformed?

JSON Event Format 1.0

As a side note, the JSON Format spec makes this even more confusing:

If the implementation determines that the type of data is Binary, the value MUST be represented as a JSON string expression containing the Base64 encoded binary value, and use the member name data_base64 to store it inside the JSON object.

This basically says that you have to Base64-encode any simple JSON string (which is, of course, binary). Also, if a receiver does not implement the optional (!) JSON Format spec, it won't be able to parse the data_base64 value; consequently, implementing the JSON Format spec as a sender means not implementing the full CloudEvents spec.

kevinbader avatar Jan 22 '20 18:01 kevinbader

@n3wscott @clemensv any comments on this one?

duglin avatar Jan 23 '20 15:01 duglin

Trying to remember the history....

See https://github.com/cloudevents/spec/blob/master/json-format.md#31-handling-of-data

In your example: "data" : "{\"foo\": \"bar\"}" , what is data ? Is it a string that just happens to look like JSON or is it actually JSON? The content type being "app/json" doesn't help here since both treating it as a string is a valid JSON value. However, in this case, per section 3.1 it is a string that just happens to look like JSON.

I agree that when comparing the xml and json examples in the spec there appears to be a bit of an inconsistency, but (if I'm remembering correctly) we treat JSON payloads differently because they're JSON. Clearly, we can't put raw XML into JSON w/o some kind of encoding, but JSON doesn't have that problem, so it makes more sense to put the JSON into data in it's raw form.

@clemensv @cneijenhuis @n3wscott ?

duglin avatar Jan 29 '20 12:01 duglin

However, in this case, per section 3.1 it is a string that just happens to look like JSON.

Exactly, that's my point - datacontenttype is not interpreted like it is for other media types and the spec doesn't allow to transmit a data structure JSON-encoded (at least the receiver couldn't tell whether it's a JSON encoded value or simply a JSON string without trying to decode it).

I agree that when comparing the xml and json examples in the spec there appears to be a bit of an inconsistency, but (if I'm remembering correctly) we treat JSON payloads differently because they're JSON.

The event format could be XML, right? So if content-type is application/cloudevent+xml and within that event datacontenttype is set to application/json, you'd clearly need to stringify (i.e., JSON-encode) your data (and XML-escape it). So you could say the spec actually means to never encode stuff that already uses the event format as its encoding, so JSON data wouldn't be encoded if the event format is also JSON. But then it would be consistent to do the same in case of XML: if data is already XML encoded and the event format is an XML, you'd expect data to not be a string but as XML "in its raw form" instead whenever datacontenttype is application/xml. However, the spec says that data needs to be XML-encoded no matter the event format.

kevinbader avatar Jan 29 '20 15:01 kevinbader

@duglin I think, the important text in the section you referenced is

For any other type, the implementation MUST translate the data value into a JSON value, and use the member name data to store it inside the JSON object.

So, if data is already a JSON value, no translation is needed.

deissnerk avatar Jan 29 '20 15:01 deissnerk

Both cases, A):

"data" : "{\"foo\": \"bar\"}"

and, B):

"data" : {
  "foo": "bar
}

Are equivalent to the spec.

The message needs to be inspected to understand how to parse. How it becomes unmarshaled is up to the consumer of the event; they need to know if it is a string, array or a struct. In the case of B, this shortens the string escaping for the consumer.

~~Arrays are also allowed:~~

"data" : [
  {"foo": "bar}
]

Arrays are not allowed. my bad.

n3wscott avatar Jan 29 '20 15:01 n3wscott

Are equivalent to the spec.

is that true? How would a JSON Object that's just a simple string but looks like JSON be serialized then if not like this: "data" : "{\"foo\": \"bar\"}"

duglin avatar Jan 29 '20 16:01 duglin

@n3wscott In the JSON Schema data is defined as being one of two types, object or string:

"data": {
   "type": ["object", "string"]
}

but array type should not be allowed afaik.

You can however of course:

"data" : {
 "foobar": [ ... ] 
}

tsurdilo avatar Jan 29 '20 16:01 tsurdilo

Are equivalent to the spec.

is that true? How would a JSON Object that's just a simple string but looks like JSON be serialized then if not like this: "data" : "{\"foo\": \"bar\"}"

The spec says you can promote that inner json object or use it as a string. The consumer needs to inspect the string escaping to understand if it should be a string or object based on what it is trying to un-marshal into.

n3wscott avatar Jan 30 '20 16:01 n3wscott

If I see: "data" : "hello" in the event, I would hope that it would result in a single string with value of "hello" being passed on to the app. If so, I would not expect to see anything different if I replaced "hello" with something that looked like JSON, other than a new value for the string.

Where in the spec does it say you can alternate between a string or json? I didn't see it in: https://github.com/cloudevents/spec/blob/master/json-format.md#31-handling-of-data

duglin avatar Jan 30 '20 16:01 duglin

@duglin It cannot alternate. It's either-or: https://github.com/cloudevents/spec/blob/master/spec.json#L12

tsurdilo avatar Jan 30 '20 17:01 tsurdilo

IMO having both data and data_base64 is confusing. especially when you can define datacontenttype.

tsurdilo avatar Jan 30 '20 17:01 tsurdilo

Are equivalent to the spec.

is that true? How would a JSON Object that's just a simple string but looks like JSON be serialized then if not like this: "data" : "{\"foo\": \"bar\"}"

The spec says you can promote that inner json object or use it as a string. The consumer needs to inspect the string escaping to understand if it should be a string or object based on what it is trying to un-marshal into.

soooooo.... funny story... the spec changed at the last minute between 0.3 and 1.0 and it now is a must.

A):

"data" : "{\"foo\": \"bar\"}"

and, B):

"data" : {
  "foo": "bar
}

Are both valid JSON but are not the same. A has been string escaped and is a JSON string. B is a Json object.

n3wscott avatar Jan 30 '20 17:01 n3wscott

As my original post above suggests, I also have some difficulties following the JSON Format spec and would appreciate guidance (I'm currently implementing an Elixir lib for handling CloudEvents).

The core spec says that Binary = "Sequence of bytes. String encoding: Base64 encoding per RFC4648.". The JSON Format spec says that "If the implementation determines that the type of data is Binary, the value MUST be represented as a JSON string expression containing the Base64 encoded binary value, and use the member name data_base64 to store it inside the JSON object.". So how should the implementation actually figure out whether something is considered Binary? It is was binary in the first place, data would already contain the base64-encoded data and the JSON format spec then suggests that data - which is a string and thus also Binary - must be base64-encoded again and put into data_base64. I don't think this was the intention of the spec. Can you please state when base64 encoding must be used and how exactly an implementation should figure out whether something is Binary (as it's not trivial to differentiate between an utf8 string and non-readable binary data)?

kevinbader avatar Jan 31 '20 12:01 kevinbader

@kevinbader +1, I also think that if you allow "data" to be of type string, then datacontenttype+data is enough. There is no need for data_binary imo. wdyt?

tsurdilo avatar Jan 31 '20 14:01 tsurdilo

There was some discussion around this in #520. In 0.3 we still had an attribute called datacontentencoding, but we had to remove it. Instead the data_base64 field was introduced in formats where it was needed.

There is no simple way to determine from a content-type, if the data is binary. Therefore an SDK will check for some well-known text-based formats and do base64 encoding otherwise. For an example of such a check, please have a look at what @alanconway wrote here.

By using either data or data_base64, the receiving side always knows, if a base64 decode step is needed.

deissnerk avatar Jan 31 '20 15:01 deissnerk

@deissnerk thank you for the explanation! Got one more question:

CE also defines the "dataschema" property as "Identifies the schema that data adheres to." Could we then not define our schema as for example: { "type": "string", "contentEncoding": "base64", "contentMediaType": "image/png" } (https://json-schema.org/understanding-json-schema/reference/non_json_data.html) which then would clearly identify contents of data (and data_base64 would be not needed at all)?

tsurdilo avatar Jan 31 '20 16:01 tsurdilo

@tsurdilo The dataschema attribute is defined as a URI. So you can of course use it to identify a schema like the one you posted, but it won't have any effect on the way the an SDK handles the event. Do you want to send an event, that contains base64-encoded data end-to-end? That would be different from data_base64 that is only used to put binary data into the JSON format.

Btw, I think the "contentEncoding":"base64" attribute could be added to #/definitions/data_base64 in spec.json.

deissnerk avatar Feb 04 '20 14:02 deissnerk

@deissnerk added contentEncoding as part of the overall json schema update - pr https://github.com/cloudevents/spec/pull/563 Would you mind reviewing this pr? Thanks.

tsurdilo avatar Feb 25 '20 16:02 tsurdilo

What's the status of this? I think we can close it but @kevinbader do you agree?

duglin avatar Mar 12 '20 15:03 duglin

@duglin well, I still think that this part of the spec could be improved a lot - I'm still not sure how to implement this in the Elixir library for CloudEvents I was going to build. I mean, depending on whether I implement the optional JSON format spec, the parser behaviour for the data field is different in a non-compatible way. But I've laid out the details above already.

I'm aware that fixing this would mean a breaking change to the spec. But given that 1.0 is less than five months old and therefore it likely hasn't seen that much adoption yet, perhaps it would make sense to consider a v2 spec. If this seems reasonable to you I gladly offer you to create an RFC with some ideas on how this could be handled in a more consistent manner.

kevinbader avatar Mar 13 '20 15:03 kevinbader

@kevinbader are you concerned about the sender or receiver of CEs? I think it’s the receiver, if so, where is the ambiguity? If it’s JSON and binary then it uses the data_base64 attribute, otherwise it uses the data attribute and the value is pure JSON not a stringified version on JSON.

duglin avatar Mar 24 '20 02:03 duglin

We are currently looking into using CloudEvents 1.0 in our company. I agree with @duglin that the specification is clear.

Taking the example from @kevinbader:

    "datacontenttype" : "application/json",
    "data" : "{\"foo\": \"bar\"}"

This example above is IMHO invalid, since the value of the data attribute is a string "{\"foo\": \"bar\"}" and not a valid JSON expression.

We human beings can guess that the string happens to be an encoded JSON. If you look at a longer example, you can't be so sure anymore:

{ \"swagger\": \"2.0\", \"info\": { \"description\": \"Repository of Events used within Platform hosted applications\", \"version\": \"1.0\", \"title\": \"Platform Event Registry Microservice\", \"termsOfService\": \"Terms of Service\", \"contact\": { \"name\": \"Platform Core Team\", \"email\": \"[email protected]\" } }, \"host\": \"localhost\", \"basePath\": \"\/\", \"tags\": [ { \"name\": \"event-registry-controller\", \"description\": \"Event Registry Controller\" } ], \"paths\": { \"\/api\/v1\/app\/{appId}\/events\": { \"get\": { \"tags\": [ \"event-registry-controller\" ], \"summary\": \"Get events for an application\", \"description\": \"Retrieve list of events which are produced and consumed by the given application.\\n* Called with: **Platform Admin** role\", \"operationId\": \"getAllEventsByAppIdUsingGET\", \"produces\": [ \"application\/json;charset=UTF-8\" ], \"parameters\": [ { \"name\": \"appId\", \"in\": \"path\", \"description\": \"appId\", \"required\": true, \"type\": \"string\", \"format\": \"uuid\" } ], \"responses\": { \"200\": { \"description\": \"OK\", \"schema\": { \"type\": \"object\", \"properties\": { \"consume\": { \"type\": \"array\", \"items\": { \"type\": \"object\", \"properties\": { \"consumers\": { \"type\": \"array\", \"items\": { \"type\": \"object\", \"required\": [ \"appID\" ], \"properties\": { \"appID\": { \"type\": \"string\", \"format\": \"uuid\" } }, \"title\": \"ApplicationDTO\" } }, \"eventAlias\": { \"type\": \"string\" }, \"eventDescription\": { \"type\": \"string\" }, \"eventId\": { \"type\": \"string\", \"format\": \"uuid\" }, \"eventSampledata\": { \"type\": \"string\" }, \"eventSchema\": { \"type\": \"string\" }, \"eventStatus\": { \"type\": \"string\" }, \"links\": { \"type\": \"array\", \"xml\": { \"name\": \"link\", \"namespace\": \"http:\/\/www.w3.org\/2005\/Atom\", \"attribute\": false, \"wrapped\": false }, \"items\": { \"type\": \"object\", \"properties\": { \"deprecation\": { \"type\": \"string\", \"xml\": { \"name\": \"deprecation\", \"attribute\": true, \"wrapped\": false } }, \"href\": { \"type\": \"string\", \"xml\": { \"name\": \"href\", \"attribute\": true, \"wrapped\": false } }, \"hreflang\": { 

Following the specification of CloudEvents 1.0, we interpret data indicated with application/json directly as JSON without unescaping. This also makes sense on the producer side, because most of our products nowadays use JSON anyhow and we can directly paste or serialize JSON into the data node.

tweing avatar Mar 26 '20 07:03 tweing

@Thoemmeli I agree to your point, but "{\"foo\": \"bar\"}" is a string and therefore also a JSON value. In that sense the example is valid, but the datacontenttype in this case refers to the string and not to the escaped JSON object.

deissnerk avatar Apr 01 '20 15:04 deissnerk

@kevinbader thoughts?

duglin avatar Apr 01 '20 17:04 duglin

@duglin well, I've implemented this according to the spec and the replies here:

https://github.com/kevinbader/cloudevents-ex/blob/master/test/format/v_1_0/decoder/json_test.exs

Not particularly happy with it but I guess that's how it is.

kevinbader avatar Apr 01 '20 20:04 kevinbader

@duglin @deissnerk This is coming up in my work on the Ruby SDK, and I want to bring up a clarification question.

To summarize a conclusion from above:

In the following CE:

const ce1 = new CloudEvent({
  specversion: "1.0",
  id: "C234-1234-1234",
  source: "/mycontext",
  type: "com.example.someevent",
  datacontenttype: "application/json",
  data: "{\"foo\": \"bar\"}"
});

... it sounds like the data should be considered a JSON value of type string. The fact that the string's value happens to look like serialized JSON is irrelevant. It is simply a string. Therefore, if we were to serialize this CE in HTTP Binary mode, it might look like this:

CE-SpecVersion 1.0
CE-Type: com.example.someevent
CE-Source: /mycontext
CE-ID: C234-1234-1234
Content-Type: application/json

"{\"foo\" : \"bar\"}"

The data must be "escaped" in this way, so that a receiver parsing this content with the application/json content type will end up with a JSON string and not an object.

As a corollary, when deserializing an HTTP Binary mode CE with Content-Type: application/json, the HTTP protocol handler must parse the JSON and set the data attribute in memory to the actual JSON value (rather than the string representation of the JSON document). Otherwise, the content's semantics will change when the CE gets re-serialized. And this, of course, all implies that an SDK's HTTP protocol handler (and perhaps other protocol handlers as well) must understand JSON, even if the JSON structured format is not in use.

Taking that as given, consider this implication:

Earlier a comparison was made with application/xml, noting a possible inconsistency. Consider this parallel example:

const ce2 = new CloudEvent({
  specversion: "1.0",
  id: "C234-1234-1234",
  source: "/mycontext",
  type: "com.example.someevent",
  datacontenttype: "application/xml",
  data: "<much wow=\"xml\"/>"
});

If we were to treat this XML data consistently with how we treated the earlier JSON data, we would consider this data as a string node in an XML document, whose contents just happen to look like XML. Hence, serializing this as HTML-Binary might yield something like:

CE-SpecVersion 1.0
CE-Type: com.example.someevent
CE-Source: /mycontext
CE-ID: C234-1234-1234
Content-Type: application/xml

&lt;much wow="xml"/&gt;

However, my understanding of the spec, and my understanding of the current behavior of the SDKs, suggests we are not doing that. (And indeed I'm glad, because that would, in turn, imply that all protocol handlers would also need to understand XML.) Instead, we actually consider the above data as semantically an XML document and not a string. Hence, serializing this as HTML-Binary actually looks like:

CE-SpecVersion 1.0
CE-Type: com.example.someevent
CE-Source: /mycontext
CE-ID: C234-1234-1234
Content-Type: application/xml

<much wow="xml"/>

In other words, our handling of the XML content-type appears to be inconsistent with our handling of the JSON content-type.

So my clarification question is:

  1. Am I correct in my interpretation that the spec intentionally treats data with content-type application/json specially, differently from string data with content-type application/xml (or indeed any other content-type), as illustrated above?

If so, follow-up questions:

  1. Is the reason for this that we (for some reason) consider JSON uniquely special among all content types in the universe, or is the reason simply that the spec currently happens to include a JSON format but not an XML format to define how data with that datacontenttype is rendered? Suppose a future spec version adds an XML format, YAML format, Protobuf format, etc. Would we at that time need to change the behavior of those formats to be like JSON (which would be a breaking change)?
  2. How do we precisely identify which content types are to be treated in this special way? For example, application/json is obvious, but what if the datacontenttype is itself application/cloudevents+json (i.e. a cloudevent whose payload is another cloudevent)? If we do consider JSON special, it seems it might be a good idea for the spec to state that explicitly, and define how it is identified, perhaps with reference to fields in RFC 2046 or similar.

dazuma avatar Jun 24 '21 23:06 dazuma

I don't see any special handling of the JSON format here, @dazuma. I think the nesting of quotes combined with the different representations of the event in Javascript (not JSON) and HTTP binary lead to some confusion. In your example ce1 you use Javascript:

const ce1 = new CloudEvent({
  specversion: "1.0",
  id: "C234-1234-1234",
  source: "/mycontext",
  type: "com.example.someevent",
  datacontenttype: "application/json",
  data: "{\"foo\": \"bar\"}"
});

This statement gives you an object ce1with, among others, a string member data. It depends now on the SDK, what this means. In the go SDK the Event has an attribute DataEncoded that holds the raw byte sequence of the payload. Only if you call DataAs, the SDK attempts to parse DataEncoded according to the datacontenttype. How would your example look like in JSON?

Like this?

{
...
   "data" : "{\"foo\": \"bar\"}",
...
}

The go SDK would store the serialized JSON value of data in a byte sequence, where the first and the last byte would contain a ". This is a JSON specific implementation, but it is part of the unmarshaling of the JSON format.

As, in Javascript, you don't have byte sequences, you would perhaps instead store the data as string and base64 encode it depending on the datacontenttype. If you now express this string as a string literal in Javascript, it looks like this:

dataEncoded : "\"{\"foo\": \"bar\"}\""

or maybe a bit more readable:

dataEncoded : '"{\"foo\": \"bar\"}"'

If your data attribute is decoded, this means that, when serializing it again, it would have to be encoded again according to datacontenttype. This holds true for JSON, XML and every other encoding. In the go SDK there are encoders and decoders for XML, JSON and text.

deissnerk avatar Jun 25 '21 18:06 deissnerk

@deissnerk I think we are in agreement here on the interpretation of this example that I labeled ce1. The type of the data attribute is string, and so when datacontenttype is set to application/json, the value of dataEncoded must, as you point out, begin and end with " to indicate that a string is being encoded.

dataEncoded : '"{\"foo\": \"bar\"}"'

So far I think we agree. My question is actually about the ce2 example. In JSON format, this looks like:

{
...
   "datacontenttype" : "application/xml",
   "data" : "<much wow=\"xml\"/>",
...
}

I constructed ce2 to be a parallel to ce1. In ce1, the content type is application/json; in ce2, it is application/xml. In ce1, the data attribute is a string which looks like an encoded JSON document; in ce2, the data attribute is a string which looks like an encoded XML document. In all aspects, ce1 and ce2 are in parallel; the one case using JSON and the other using XML, but in both cases, the type of data is string.

For ce2, what is the value of dataEncoded?

Is it: dataEncoded : '<much wow="xml"/>' (which would decode to an XML document)

Or is it: dataEncoded : '&lt;much wow="xml"/&gt;' (which would decode to a string)

dazuma avatar Jun 27 '21 07:06 dazuma

Having the same question as @dazuma:

At IANA, there are about 114 application/.*+json media types registered (note: application/cloudevents+json is not even there). Are these JSON documents or should we treat them like the XML examples above?

athalhammer avatar Jun 28 '21 13:06 athalhammer

@dazuma IMHO it should be: dataEncoded : '<much wow="xml"/>'

@athalhammer Currently the go SDK only recognizes application/json and text/json into account., but it's an interesting question. @n3wscott @slinkydeveloper What do you think?

For the pending IANA registration we already have #557

deissnerk avatar Jun 28 '21 16:06 deissnerk