spec icon indicating copy to clipboard operation
spec copied to clipboard

XML format: do we have a whitespace policy?

Open jskeet opened this issue 2 years ago • 2 comments

For example, assuming appropriate namespaces, should we automatically preserve leading and trailing whitespace on context attributes?

<event>
  <!-- All the standard attributes etc -->
  <customattribute1 xsi:type="ce:string">    middle     </customattribute1>
  <customattribute2 xsi:type="ce:string">
    middle
  </customattribute2>
</event>

How does this apply to text and base64 data elements?

(Doug: please assign to me or Jem as a starting point.)

jskeet avatar Apr 29 '22 07:04 jskeet

If we try to compare this to json then it would be similar to:

"customattribute1": "    middle     "

and I suspect there's no doubt as to whether the whitespace is part of the desired value or not. While it might annoy those who like to pretty-print things and assume spacing is ignored, from a spec perspective it seems like we have no choice but to take a stand of "we must not lose data" and "we can not assume which data doesn't matter to the user". However, that's not to suggest that we couldn't add additional statements around the processing of certain attributes. For example, I wonder if it would be ok to:

  • say: unless otherwise stated, all xml nodes under a CE attribute are part of its contents and MUST be preserved
  • and then for appropriate attributes (e.g. timestamp) say leading and trailing whitespace MUST be ignored for parsing purposes

and perhaps we even drop the 2nd bullet for spec defined ones because if we don't think the "out" applies to any of our attributes, but leave it as an option for extensions.

duglin avatar Apr 29 '22 12:04 duglin

I think we should be able to drop comment nodes:

<id><!-- this is autogenerated -->012345</id>

should be fine and just assign an ID of "012345" for example. Comments are not part of the context attribute data model, so I wouldn't expect them to be magically preserved here. But yes, I think whitespace should be - and I would suggest we don't trim timestamp attributes etc either... we just say that whitespace must be preserved, and whatever comes out of that is what happens. I'll try to find sometime to write tests for this in the next few weeks...

jskeet avatar Apr 29 '22 13:04 jskeet

where did we land on this one? I seem to recall that we did chat about it on a call but I can't remember who took an action from it... or if there was one. I suspect @jskeet may own it though :-)

duglin avatar Feb 16 '23 16:02 duglin

Yes, I own it, and no, I haven't done anything yet (beyond validating that we do handle this in HTTP headers - a string attribute could definitely have spaces at the start and end, and they're escaped).

Every time I try to come up with something pragmatic, it feels wrong in the end. But it's still mine to progress unless anyone else wants to take it, basically.

jskeet avatar Feb 16 '23 16:02 jskeet

This issue is stale because it has been open for 30 days with no activity. Mark as fresh by updating e.g., adding the comment /remove-lifecycle stale.

github-actions[bot] avatar Mar 19 '23 01:03 github-actions[bot]

/remove-lifecycle stale

jskeet avatar Mar 19 '23 06:03 jskeet

Okay, I've been digging into significant vs insignificant whitespace a bit more, and I'd misunderstood XML fundamentals a little. In the examples given above, all the whitespace is naturally significant - it's only in a case where there's just whitespace between elements that it's reasonable for it to be removed.

So, the question I'd like to discuss in the meeting is how we should handle different attributes. I can see three cases for each element value:

  • Case 1: Single line, no leading/trailing whitespace
  • Case 2: Single line, leading and/or trailing whitespace
  • Case 3: A single non-blank line, but with leading and trailing whitespace-only lines or line breaks
  • Case 4: Multiple non-blank lines

Example (non-CE specific, just to give context):

<root>
  <case1>Simple</case1>
  <case2>     Leading and trailing    </case2>
  <case3>
    Single text line with leading and trailing line breaks
  </case3>
  <case4>
    Line 1
    Line 2
  </case4>
</root>

For each CE attribute (or at least each CE attribute type) which would work out how we want to handle each case - both in terms of "theoretical" validity, and how strict we think SDKs are likely to be in reality.

Proposal 1: don't even think about it in the SDKs, and make the spec represent what that means

Just parse each value in the simplest way, without any trimming:

  • String attributes will be valid for cases 1 and 2, but invalid in cases 3 and 4 (as string attributes cannot include line breaks) - some SDKs may not spot this, at a guess
  • Integer, Binary, Timestamp and Boolean attributes will be valid for case 1, but invalid for cases 2-4 assuming stringent parsers
  • URI and URI-Reference attributes may be valid for cases 1-3 - it depends on the URI parser's behavior. (The .NET URI parser allows it.) Case 4 would be invalid, and I'd expect most URI parsers to naturally fail on it.

Proposal 2: Explicitly prohibit line breaks

Basically this is like proposal 1, but would also make case 3 invalid for URI attributes.

Proposal 3: trim intelligently

For attribute types where leading and trailing whitespace can never be useful - i.e. everything except string attributes - we could say that cases 2 and 3 are equivalent to case 1. (In spec terms we'd say that leading and trailing whitespace is ignored; in SDKs we'd trim before parsing the value.) Case 4 would be deemed invalid. (We could say that line breaks which aren't part of leading/trailing whitespace are prohibited.)

For string attributes it makes sense for case 2 to preserve the leading/trailing whitespace - but for case 3 we'd end up trimming all leading/trailing whitespace. (In other words, if you want leading/trailing whitespace, you have to represent it in a single line.) Case 4 would still be invalid.

Any other proposals?

Happy to edit in anything that comes up in discussion.

Jon's take

Anything claiming "intelligent" handling makes me nervous. I'd therefore prefer not to try to implement proposal 3.

Proposal 1 is nice and simple, but it seems weird that URI attributes might be valid in multi-line elements, but nothing else.

Proposal 2 seems suitably well-defined and easy to implement, and gets my vote.

jskeet avatar Mar 21 '23 09:03 jskeet

@JemDay Would love to get your take on this :)

jskeet avatar Mar 21 '23 09:03 jskeet

I'd lean toward option #1...

  • I prefer a "say what you see" approach as opposed to putting 'intelligence' into an SDK and have it potentially mis-interpret something.
  • Keeping it simple means we're not having to treat 'textual data' any differently from context attribute values.
  • As an aside I imagine this could equally apply to JSON Format.

JemDay avatar Mar 21 '23 15:03 JemDay

As an aside I imagine this could equally apply to JSON Format.

Somewhat less so, as JSON can't contain line breaks. (Obviously JSON string values can contain escaped line breaks, but that's slightly different, I'd argue.) But yes, in terms of whether "someuri": " https://hello " is valid or not...

jskeet avatar Mar 21 '23 16:03 jskeet

There may be a bigger question as to how 'strict' the SDKs are meant to be ... I'd assume that they shouldn't allow the production of a context-attribute value containing characters outside the supported charset; but should they reject, ignore, or accept values outside that defined range ?

JemDay avatar Mar 21 '23 16:03 JemDay

should they reject, ignore, or accept values outside that defined range ?

Personally I've always been in favour of being strict. If one SDK allows an invalid value, it's likely to then propagate the value to others, etc - leading SDKs which are strict to be "blamed" for not accepting bad events passed on.

Basically I fundamentally disagree with Postel's law: if everything is strict with both what they produce and consume, we never get into "well it works with my parser, therefore yours must be wrong" territory.

jskeet avatar Mar 21 '23 16:03 jskeet

As an aside I imagine this could equally apply to JSON Format.

Somewhat less so, as JSON can't contain line breaks. (Obviously JSON string values can contain escaped line breaks, but that's slightly different, I'd argue.) But yes, in terms of whether "someuri": " https://hello " is valid or not...

True .. But just because you can escape a CR in JSON text doesn't mean it's valid content from a CE point-of-view - which I don't believe it is.

At least from an XML (and proto) perspective we can tell the expected type of the value being exchanged so technically those malformed URIs could be rejected .. I don't think you can do that with the JSON format as the type information isn't present.

JemDay avatar Mar 21 '23 16:03 JemDay

True .. But just because you can escape a CR in JSON text doesn't mean it's valid content from a CE point-of-view - which I don't believe it is.

Indeed it isn't, and I'd hope that SDKs would prevent that.

At least from an XML (and proto) perspective we can tell the expected type of the value being exchanged so technically those malformed URIs could be rejected .. I don't think you can do that with the JSON format as the type information isn't present.

Correct, unless you have extra context.

It's not even clear to me whether URIs with leading/trailing whitespace are actually malformed. I'd need to check the RFCs very carefully... URIs can be surprisingly lenient IME.

jskeet avatar Mar 21 '23 16:03 jskeet

as string attributes cannot include line breaks

I'm not sure I agree with this for XML (cases 3 & 4). Why wouldn't someone expect the line breaks to be preserved? What if they were in the middle of the text? I think generating an error on cases 3&4 will really annoy people.

I like: don't trim at all, but xml-comments can be removed when part of the attribute contents (except data of course).

duglin avatar Mar 23 '23 15:03 duglin

Why wouldn't someone expect the line breaks to be preserved?

Because they're not valid in CE string attributes:

Sequence of allowable Unicode characters. The following characters are disallowed: the "control characters" in the ranges U+0000-U+001F and U+007F-U+009F (both ranges inclusive), since most have no agreed-on meaning, and some, such as U+000A (newline), are not usable in contexts such as HTTP headers.

jskeet avatar Mar 23 '23 15:03 jskeet

whoa totally forgot about that - what a bummer

duglin avatar Mar 23 '23 15:03 duglin

I think a statement on expectations for SDK "strictness" might be warranted :-(

JemDay avatar Mar 23 '23 15:03 JemDay

We didn't note it at the time, but my recollection of the meeting is that we decided to go with proposal 2. I'm writing a PR for that.

jskeet avatar Mar 24 '23 14:03 jskeet