common icon indicating copy to clipboard operation
common copied to clipboard

Defaults in Messages JSON and Messages objects

Open gasparnagy opened this issue 3 years ago • 8 comments

I want to share my thoughts on this topic, as a kind of discussion forum here.

Defaults in Messages JSON

As I worked a bit with the new messages structure, I would like to bring up (again) the problem of defaults in the messages protocol. Currently I have found 3 cases that were not fully consistent with my understanding:

  • Scenario[/Background/...].description
  • Scenario[/Background/...].name
  • TableCell.value

These are currently set as required in the JSON schema, although they are optional in Gherkin and in many cases they are usually not provided (e.g. I have never seen anyone adding a name or description to a Background, but also Scenario description is rarely used). Forcing ourselves to include a dummy empty setting for them in the messages JSON files seems to be unnecessary for me.

As far as I have understood @aslakhellesoy, the original intention was to populate these properties for dynamic languages, but with TS this does not seem to be necessary anyway.

A similar question could be raised for the empty arrays, but those are more correct semantically.

Location.column is currently set as optional (inconsistency with the others)?

Also I feel that if we make "set everything as required" as a rule for the messages JSON, then if later we introduce a new (optional) setting (and make it required by following our rule) then the old JSON message files will became invalid according to the schema, so it will be harder to introduce non-breaking changes.

So my questions are:

  • Shall we make these mentioned string fields optional in the JSON schema?
  • Shall we make the arrays optional in the JSON schema?

Defaults in Messages objects

If we would make these settings as optional in the JSON schema, we could still think about the default value that we use in the populated message objects if the particular field was not specified. As far as I know JSON schema does not provide an option to explicitly specify the default value, so this can only be an agreement that the different implementations follow.

I think here there are 3 questions:

  • What should we use as default value for not-provided optional strings (empty string or null)?
  • What should we use as default value for nor-provided optional arrays (empty array or null)?
  • What should we use as default value for nor-provided optional numbers (0 or null -- for null we need to represent it as nullable number in the objects for static type languages)?

What do you think?

gasparnagy avatar May 06 '21 08:05 gasparnagy

Good question. So my thoughts / observations.

100% agree Description on Background - I've never once seen it in 5 years Scenario description "can" be used. And I've seen it occassionally, but then it's not always picked up nicely by html formatters e.t.c. (Not tested on new one).

IMO almost any string field in Gherkin should be optional, save for things which come after Gherkin keywords. Answers for default

  • Empty string
  • Empty array
  • .... Tricky one. I would probably lean more towards null but I'm not the best person to ask about how much extra work this entails. Ruby (primary language), is quite flexible in this regards. I appreciate static ones might need a lot more skeleton work for this.

luke-hill avatar May 06 '21 08:05 luke-hill

Broadly agree with @gasparnagy regarding defaults.

Specifically my feeling would be:

  • string - don't default to empty string
  • array - don't feel strongly either way

Location.column is currently set as optional (inconsistency with the others)?

I think there was a functional reason for this that @aslakhellesoy mentioned the other week but I don't remember what?

With descriptions in Scenario, Background et al, I think at runtime for building reports we would be checking whether there is a non-blank string (i.e. has content other than whitespace) so I don't think we'd be "saving" a check there by defaulting to an empty string.

TypeScript-specific aside: if we were to drop the idea of defaults on deserialisation, the TypeScript classes generated from the JSON schema could just become interfaces, and we'd no longer have the weight of those classes in the compiled JavaScript (though this saving is probably an order of magnitude less than the one from losing the protobufjs runtime).

davidjgoss avatar May 06 '21 14:05 davidjgoss

Personally I have an aversion to nulls in general: they breed if statements, which in turn breed bugs. But in a protocol like this there will be some places they're appropriate - the backwards compatibility case you've mentioned @gasparnagy being one of those.

What's important to me is the semantics of the protocol - an empty string or array and a null mean different things, and we should be clear about which meaning we want to give. So I think we might need to look at each particular field in the messages where we might need to handle nulls, not just string/array/numbers in general.

From the semantics perspective. I don't see the problem with representing "the user did not specify a description for this scenario" with an empty string. It's not like the user could tell us, in their Gherkin document, I want to have a description for this scenario, and I want it to be blank (i.e. empty string) versus I don't want this scenario to have a description (i.e. null). Either they give us a description or they don't.

Similarly empty arrays seem like a fine way to represent nothingness in all the cases I can think of - e.g. the tags for a scenario.

Numbers, I think are different. Zero is a fairly arbitrary default value, and implies more knoweledge of the answer than might be appropriate. A null value for location.column says "I don't know (or cannot say for this language) what column the thing is located at" wheras a zero imples "I know where it's located, it's located in the first column".

Them's my thoughts, does that help?

mattwynne avatar May 06 '21 19:05 mattwynne

It looks to me that JSON Schema supports default values. See https://json-schema.org/draft/2020-12/json-schema-validation.html#rfc.section.9.2

aslakhellesoy avatar May 09 '21 20:05 aslakhellesoy

I think this is easier to reason about if we think about the various roles our components play:

  • Message producers
    • Gherkin Parser (produces source and gherkinDocument messages)
    • Pickle Compiler (produces pickle messages)
    • Cucumber (produces all the other message types)
  • In-Process message consumers
    • Pickle Compiler (consumes gherkinDocument messages from the Gherkin Parser)
    • Cucumber (consumes pickle messages)
  • Out of process message consumers
    • Cucumber-React (JavaScript)
    • Standalone JSON Formatter (Go)
    • Future components yet to be written

If we only had in-process message consumers, we wouldn't need to worry about required fields and default values. There wouldn't be any JSON serialisation or deserialisation. If the producer didn't provide a value for a property, it would either be caught by the compiler (TypeScript), or if the language doesn't prevent assigning null values to fields (Java, not sure about C#) it would likely be caught by a null reference in a test.

It's primarily the out-of process message consumers we need to worry about. They should assume that the messages they consume contain garbage.

Default values

We can reduce the amount of garbage with default values. When we parse a JSON string into an object, any property missing from the JSON would get the default value from the JSON Schema.

This is similar to how the protobuf deserialisation works. After deserialisation, any missing property gets the "null" value by default ([] for arrays, 0 for numbers and "" for strings). The main difference is that the default value is not explicitly declared in the schema, but follows a convention defined by the protobuf standard.

We have a choice how we want to deal with default values. We can either mimic the protobuf behaviour (i.e. conventional default values), or we can rely on the default value from the schema.

I prefer the explicit style since it doesn't require programmers to know about any conventions. It also gives us more fine-grained control at the cost of being a little more verbose in the schema.

Required properties

Declaring a property as required in the schema offers no guarantees to the consumer that the property will have a value in the JSON. It could however give the consumer the confidence that the deserialised object has a value for all required properties.

If we want to avoid consumers littered with conditionals checking if a required property actually has a value after deserialising JSON, we can:

  1. Provide a default value for required properties that were absent from the JSON.
  2. Throw an error if a required property was absent and the schema didn't specify a default.

With this in place, consumers can safely assume that required properties will always have a value after serialisation.

This means that the deserialisers need to know about the schema.

What does this mean in practice?

I think we should go over the schemas and apply some rules to decide whether a particular property should be required, have a default or both. Here is my proposal:

required, no default

Any property where the absence of a value would mean the consumer cannot do the job properly. Examples are id, timestamp and duration properties.

This cannot be used for properties added in future versions.

required, with default

Any property that's optional where we want to avoid a null check in the consumer. I think this goes for most of the properties, for the reasons @mattwynne has described. The default value declared in the schema would typically be [], 0 or "" (protobuf style).

Examples are name, title and description

not required, no default

I think we should avoid this since it requires consumer conditionals.

not required, with default

This makes no sense and should be avoided.

Future properties

Whenever we add new properties to the schema, we have to make sure new consumers (with the new property) can consume messages from old producers.

We have the following options:

  • With default - no need for consumer conditionals
  • Without default - needs consumer conditionals

I think we should evaluate on a case by case basis which option is best

aslakhellesoy avatar May 10 '21 12:05 aslakhellesoy

@aslakhellesoy Thank you for the summary.

I generally agree with you, except one thing. You said, we anyway cannot "guarantee" that producers will include the required properties. I think this is not that simple. The schema is part of the specification. If we have a schema, that declares a property as "required" that means that by the specification of the cucumber messages format, the users MUST provide that value. Producers can decide to not provide the value of course, but that means their input will not conform the expectations for the message, so the result is somewhat unpredictable. (Showing a nice error message is of course nice.) If we publish the JSON schema than both the producers and the consumers can use it to (automatically) verify if the message fulfills that. This is standard for XML message validation, probably there are libraries in the different platforms where such check can be easily included in the deserialization process (ie. check if the JSON message conforms the schema and only start the deserialization if it does).

Based on that, I think we should declare in the schema what we expect and not what we might get. I.e. if a property is not required (e.g. description), than we should not mark it as "required" in the JSON schema, because with that we confuse the ones who want to use the schema as a contract (i.e. the message would fail the schema validation although the implementations would accept it). Because of that, I would NOT have anything as required with default but encourage consumers to reject any message that does not provide these values. So for name, description -> not required with default and do not use category required with default

I agree to the explicit mention of the default values (wherever possible), and I think we have a consensus on using empty string for string properties and empty array for lists as default. The only one I puzzle with is the location.column, where 0 is misleading (like @luke-hill and @mattwynne also mentioned). Technically 0 would not cause a trouble, because line position is "traditionally" specified as a 1-based index, so there is not 0 column, but it is misleading. So my vote here is to use null, but as this is a minor detail anyway, I am also fine with 0.

gasparnagy avatar May 10 '21 12:05 gasparnagy

@gasparnagy having played around a bit recently with this. Do we think this is still an issue, and if so should we move it to the messages repo?

I noticed things like location often are half filled (So not all the props are passed in), and things seem to be working well.

Do we think we need additional work here>?

luke-hill avatar Jan 08 '24 13:01 luke-hill

I haven't checked the JSON schema recently, so I don't know. All I wanted is that if we make a specification about the messages (like a JSON schema) that should be valid for all the JSON messages we send across. So if we don't typically send e.g. description than that field should not be marked as required. Aslak and Matt sees the JSON schema more as a conceptual description of the message structure, primarily for code, where the "required" makes sense. I see that more as a specification of the pure JSON data we send and hence the disagreement. But it's not super important for me, so we can leave it as it is and close this issue. I'm not aware of any tools that would actually do schema validation, so there is no concrete error anyway.

gasparnagy avatar Jan 08 '24 15:01 gasparnagy