quick-xml icon indicating copy to clipboard operation
quick-xml copied to clipboard

Recognize and process some special XML attributes

Open Mingun opened this issue 1 year ago • 9 comments

All names starts with xml (case-insensitive) are reserved by the XML standard, and some of them has special meaning. quick-xml could process some of them:

  • xml:lang -- https://www.w3.org/TR/xml11/#sec-lang-tag (meta information about natural language of texts, stacked like namespace definitions)
  • xml:space -- https://www.w3.org/TR/xml11/#sec-white-space (related: #285)
  • xsi:nil -- map <element xsi:nil="true"/> to None if deserialized to Option

Mingun avatar Aug 25 '22 11:08 Mingun

it's only the concrete prefixes xml and xmlns that are globally defined: https://www.w3.org/TR/xml-names/#xmlReserved

funkyfuture avatar Sep 08 '22 18:09 funkyfuture

The NamespaceResolver doesn't seem to have any entries by default. This seems incorrect by my reading of https://www.w3.org/TR/xml-names11/#xmlReserved. Shouldn't the xml namespace be definitionally mapped to http://www.w3.org/XML/1998/namespace?

It also looks like the xml namespace should not be overrideable.

wt avatar Jan 24 '23 23:01 wt

FWIW, xmlns also has a similar definitional mapping. However, given how xmlns is handled now in quick-xml, that probably doesn't need to be handled the same way. However, I am wondering if handling them the same way would make it more consistent to deal with reserved namespaces like this.

wt avatar Jan 25 '23 00:01 wt

Shouldn't the xml namespace be definitionally mapped to http://www.w3.org/XML/1998/namespace?

Yes, it should.

It also looks like the xml namespace should not be overrideable.

It seems that override it is technically possible, but such XML document is incorrect (but seems still well-formed and valid, but not namespace-well-formed because overriding violates namespace constrains).

Mingun avatar Jan 25 '23 06:01 Mingun

Well-formed docs have to have a root element conforming to the description at https://www.w3.org/TR/xml11/#NT-element. This section says the following:

This specification does not constrain the application semantics, use, or (beyond syntax) names of the element types and attributes, except that names beginning with a match to (('X'|'x')('M'|'m')('L'|'l')) are reserved for standardization in this or future versions of this specification.

This seems to indicate that such attr names with xml at the front are reserved. AFAICT, this includes the namespace part of the attr name. Would this mean that have an attr like xmlfjdkslafjdsl=3 make the document not well-formed? Given that the doc has to be well-formed to be valid, wouldn't that also make the doc invalid?

Either way, I think this can be broken into two pieces.

  1. Initialize the name space resolver so that xml is already resolvable.
  2. If overriding the xml namespace should be blocked, do that also.

I think that we have agreement that 1 is worth doing now. I have an idea for that that I will generate a PR for. I will link it to this issue.

For 2, my opinion is that maybe we should not allow overriding xml namespace by default, but we should maybe have a flag that allows it. What do you think of that?

wt avatar Jan 25 '23 06:01 wt

  1. Agreed, please make a PR
  2. I think we should check other popular XML libraries, and do in the similar way

Mingun avatar Jan 25 '23 07:01 Mingun

I linked a PR to resolve the reserved namespaces.

wt avatar Jan 25 '23 08:01 wt

Would this include things like xsi:type as well?

In order to include it as an attribute in an element, I used #[serde(rename = "@xsi:type")] which causes it to serialize correctly, e.g. <MyElement xsi:type="MyType">

However, deserialization of the same blob fails because

"missing field `@xsi:type`"

rrichardson avatar Feb 15 '23 22:02 rrichardson

Looking more closely at the issue and the PR. I think my issue is unrelated. I'll add a ticket.

rrichardson avatar Feb 15 '23 22:02 rrichardson