Consider a recommendation for how to resolve ambiguity converting a Structure Tree to XML
This issue is a recommendation that we formalise, or at least document, an approach taken by the LaTeX team when converting to/from MathML in it's XML form, and MathML as it's stored in the StructureTree.
Background
XML has a concept of "attributes with a namespace" and "attributes with no namespace" - not a default namespace, literally "no namespace".
Examples of "attributes with no namespace"
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block">
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" display="block">
Example of "attributes with a namespace" - requires a prefix on the attribute.
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" m:display="block">
PDF 2.0 does not have this concept; every attribute is assigned to a namespace owner.
In https://github.com/pdf-association/pdf-issues/issues/286 this was discussed and it was agreed that PDF namespaces and XML namespaces are different - we added this note:
NOTE The attribute owner, defined through the O and NS entries in the attribute object, define an owner for each attribute, but do not provide information on transformation of those attributes into other formats. When considering formats such as HTML and MathML, attributes would be transformed to meet the syntactic requirements of those formats.
Proposal
I'm essentially suggesting that we revisit the resolution from https://github.com/pdf-association/pdf-issues/issues/286, and extend that note, or add another.
Conversion between XML and the PDF Structure Tree happens. The note is telling people not to make any assumptions about how to process namespaces when doing this - that's OK, but is still avoiding the problem of what people are expected to do when converting between these two formats. It's not specifically a MathML issue, but it's the most obvious namespace where this applies.
Fortunately @davidcarlisle proposed what I think is a rather elegant solution in the 2025-03-06 PDF/UA TWG meeting, specifically:
if a PDF attribute has the same namespace object as the element it's attached to, treat the attribute as having "no namespace" when converting to XML. Otherwise, treat the attribute as being in that namespace when converting to XML.
This is very neat because it solves the ambiguity described in https://github.com/pdf-association/pdf-issues/issues/286 but requires no changes to the PDF spec - it's just a guidance for PDF processors. But it's not documented anywhere in the PDF spec, and I really think that needs to change (although I believe it might be part of an upcoming LaTeX best-practice guide).
The consequences if we don't do this is that anyone extracting MathML from PDF to XML will quite possibly generate incorrect MathML. The same statement applies to HTML and any other content originating from an XML syntax that we might reasonably expect to see stored in PDF.
The only other way around this is namespace-specific knowledge, which I don't think is reasonable. If someone adds some ChemML content to a PDF, I shouldn't have to know ChemML to correctly namespace the attributes when deriving XML. This is not a problem when deriving HTML, but only because HTML does have this namespace-specific knowledge - and HTML is limited to MathML, SVG and HTML.
Proposed Action
The addition of another note, or an addendum to the existing note, reading something like this:
PDF processors extracting attributes to XML may consider attributes with an identical NS entry to the structure element to be attributes without a namespace, and attributes with a different NS entry to be attributes in the specified namespace.
I have made this as non-normative as I can; it's guidance, nothing more. Wordsmithing, I'm not sure if "attributes without a namespace" would be better phrased as "attributes in the empty namespace" - the former is a bit closer to the W3C spec language, the latter is pretty common too.
I'd be in favour of this.
Part of this mapping (restricted to the MathML namespace) is in the Best Practice Guide being developed by the PDFA LaTeX LWG, it's in more detail in the LaTeX3 project discussion at
https://github.com/latex3/tagging-project/discussions/789
This is important, as for example without this it leaves MathML structure elements completely undefined (other than having /NS the MathML namespace) as the MathML specification says nothing about PDF Structure elements, so without a mapping to XML from PDF Structure Elements there is no specification of MathML Structure Elements at all.
Note that PDF does have a natural encoding for attributes in no namespace, it's just not used in the mapping described there. They could be modelled by keys in the dictionary that represents the structure element, not as PDF Structure attributes at all. We do do this for some specific attributes notably id attributes in the XML are modelled by the ID key in the dictionary, not by an id attribute with /NSO and the MathML Namespace.
Having the details spelled out in a PDF Association rather than LaTeX Project document would be helpful.
I typically use the idiom "in no namespace" for elements or attributes not in a namespace. Probably influenced by the phasing in the XSLT spec eg
In all other cases, a lexical QName with no prefix represents an expanded QName in no namespace (that is, an xs:QName value in which both the prefix and the namespace URI are absent).
from https://www.w3.org/TR/xslt-30/#dt-lexical-qname
Agreed that for some special attributes (id, class, alt, lang) there is a natural way to do this, but adding keys directly to the StructElem is asking for trouble when the spec is revised in X years and StructElem gains some new keys!
And of course this is a strictly a PDF 2.x issue, which is probably obvious but worth stating.
I think "can" instead of "may", as the statement is factual, not permissive.
@mrbhardy this item should probably be addressed in the Reuse TWG rather than the PDF TWG.
Agreed that for some special attributes (id, class, alt, lang) there is a natural way to do this, but adding keys directly to the StructElem is asking for trouble when the spec is revised in X years and StructElem gains some new keys!
Sure, agreed with that which is why we went with Structure attributes except for a few special cases. (Although PDF Structure attributes are more or less completely different to XML attributes, apart from the name.) However when mapping to XML (or example to try to make use of the standard Schema entry in a namespace dictionary) You need a way to map any Structure Element to XML even if it's using features not used in html/mathml. So the mapping in the tagging-project page describes mappings for unknown keys and user properties and other things we never actually generate.
The PDF 2 spec ought to have specified this (but doesn't) so this proposal would be a small step in the right direction.
In the proposed note I think you have to include the special cases. In the title of this issue you have worded it as a mapping from PDF Structure Tree to XML, but equally important is the reverse mapping, PDF generators need to know how to encode a MathML formula as a fragment of the Structure Tree. We map <mi id="blub">x</mi> to a structure element with PDF /ID key with value blub so it would be unfortunate if the new note suggested that we should generate a Structure element with attribute id with /NSO and namespace being the MathML namespace.