XML
XML copied to clipboard
Node 'indexing' from root gives strange `mod 2` result (alternating populated/unpopulated lines)
While trying to reproduce some examples (from the test folder), I stumbled onto an indexing oddity.
Below, $xml.root[1], $xml.root[3], $xml.root[5], and $xml.root[7] return text.
Conversely, $xml.root[0], $xml.root[2], $xml.root[4], $xml.root[6], and $xml.root[8] return blank lines.
Is this canonical XML behavior? Thx.
~$ raku -MXML -e 'my $xml=open-xml($*ARGFILES.Str); $xml.say;' ~/exemel_text.xml
<?xml version="1.0"?><root>
<file>text1</file>
<file>text2</file>
<file>text3</file>
<file>text4</file>
</root>
~$ raku -MXML -e 'my $xml=open-xml($*ARGFILES.Str); $xml.root.say;' ~/exemel_text.xml
<root>
<file>text1</file>
<file>text2</file>
<file>text3</file>
<file>text4</file>
</root>
~$ raku -MXML -e 'my $xml=open-xml($*ARGFILES.Str); $xml.root[0].say;' ~/exemel_text.xml
~$ raku -MXML -e 'my $xml=open-xml($*ARGFILES.Str); $xml.root.[0].say;' ~/exemel_text.xml
~$ raku -MXML -e 'my $xml=open-xml($*ARGFILES.Str); $xml.root.[1].say;' ~/exemel_text.xml
<file>text1</file>
~$ raku -MXML -e 'my $xml=open-xml($*ARGFILES.Str); $xml.root.[2].say;' ~/exemel_text.xml
~$ raku -MXML -e 'my $xml=open-xml($*ARGFILES.Str); $xml.root.[3].say;' ~/exemel_text.xml
<file>text2</file>
~$ raku -MXML -e 'my $xml=open-xml($*ARGFILES.Str); $xml.root.[4].say;' ~/exemel_text.xml
~$ raku -MXML -e 'my $xml=open-xml($*ARGFILES.Str); $xml.root.[5].say;' ~/exemel_text.xml
<file>text3</file>
~$ raku -MXML -e 'my $xml=open-xml($*ARGFILES.Str); $xml.root.[6].say;' ~/exemel_text.xml
~$ raku -MXML -e 'my $xml=open-xml($*ARGFILES.Str); $xml.root.[7].say;' ~/exemel_text.xml
<file>text4</file>
~$ raku -MXML -e 'my $xml=open-xml($*ARGFILES.Str); $xml.root.[8].say;' ~/exemel_text.xml
~$ raku -MXML -e 'my $xml=open-xml($*ARGFILES.Str); $xml.root.[9].say;' ~/exemel_text.xml
(Any)
~$
Rakudo 2023.05 / MacOS;
XML:ver<0.3.3>:auth<zef:raku-community-modules>
@supernovus @jonathanstowe
Maybe related to #20 ?
raku -MXML -e 'my $xml=open-xml($*ARGFILES.Str); for ^$xml.root.nodes.elems { say "$_:"; say $xml.root[$_].^name.indent(4); say $xml.root[$_].gist.raku; }' exemel_text.xml
0:
XML::Text
"\n "
1:
XML::Element
"<file>text1</file>"
2:
XML::Text
"\n "
3:
XML::Element
"<file>text2</file>"
4:
XML::Text
"\n "
5:
XML::Element
"<file>text3</file>"
6:
XML::Text
"\n "
7:
XML::Element
"<file>text4</file>"
8:
XML::Text
"\n"
Compare:
cat exemel_text_nospace.xml
<?xml version="1.0"?><root><file>text1</file><file>text2</file><file>text3</file><file>text4</file></root>
raku -MXML -e 'my $xml=open-xml($*ARGFILES.Str); for ^$xml.root.nodes.elems { say "$_:"; say $xml.root[$_].^name.indent(4); say $xml.root[$_].gist.raku.indent(4); }' exemel_text_nospace.xml
0:
XML::Element
"<file>text1</file>"
1:
XML::Element
"<file>text2</file>"
2:
XML::Element
"<file>text3</file>"
3:
XML::Element
"<file>text4</file>"
I believe this is required by the XML specification.
I would favor closing this as "not a bug".
Could someone point out the XML requirement, in the specification document below? Thx.
https://www.w3.org/TR/2008/REC-xml-20081126/
https://www.w3.org/TR/2008/REC-xml-20081126/#sec-white-space
In editing XML documents, it is often convenient to use "white space" (spaces, tabs, and blank lines) to set apart the markup for greater readability. Such white space is typically not intended for inclusion in the delivered version of the document. On the other hand, "significant" white space that should be preserved in the delivered version is common, for example in poetry and source code. An XML processor MUST always pass all characters in a document that are not markup through to the application. A validating XML processor MUST also inform the application which of these characters constitute white space appearing in element content.
I haven't checked yet but I don't think this module validates, in other words has support for DTDs.
The following seems to be intended for validating processors as well, unless we can just say "we're not a validating processor, so we will also accept xml:space on elements where we don't have it declared as an attribute?
A special attribute named xml:space may be attached to an element to signal an intention that in that element, white space should be preserved by applications. In valid documents, this attribute, like any other, MUST be declared if it is used.
Hi @timo , your comments are beyond my pay-grade. Maybe you and @jonathanstowe can come to some conclusion on this? Otherwise I believe this Issue is in a condition where it is ready to be closed.
Thanks for all your code investigations above!
I concur with @timo's conclusion. To put it at its simplest: the Text nodes must be generated there because something like:
<?xml version="1.0"?>
<root>
<file>text1</file>
outside
<file>text2</file>
outside
<file>text3</file>
outside
<file>text4</file>
outside
</root>
could be perfectly valid, (in XSD terms root can be a complex type with mixed content,) and because the parser has no way of knowing (not validating against e.g. XSD or DTD,) what was intended the whitespace there should be considered as part of the content rather than being ignorable.
Just for completeness:
Essentially this module is giving you a fairly raw parse tree (with some sugar,) so when you index the element accessor you get all the child nodes (of any type, i.e. it is using the .nodes accessor of the element object) so if you had something like:
<?xml version="1.0"?>
<root>
<?xml-stylesheet type="text/xsl" href="style.xsl"?>
<file>text1</file>
<!-- comment -->
<file>text2</file>
<file>text3</file>
<file>text4</file>
</root>
And do:
raku -I. -MXML -e 'my $xml=open-xml($*ARGFILES.Str); for $xml.root.nodes.list -> $node { say $node.^name }' exeml_test_all.xml
You get:
XML::Text
XML::PI
XML::Text
XML::Element
XML::Text
XML::Comment
XML::Text
XML::Element
XML::Text
XML::Element
XML::Text
XML::Element
XML::Text
If you know the structure of your XML and know that it isn't "mixed" content (so you are only interested in the Element children,) then you would use the .elements accessor of the element:
raku -I. -MXML -e 'my $xml=open-xml($*ARGFILES.Str); say $xml.root.elements[0]' exeml_test_all.xml
<file>text1</file>
This may or may not be DWIMy depending on your requirements, but it definitely is consistent with the XML specifications.
And finally, just to show that this isn't just an idiosyncrasy of this module, this is the parse tree of the original document from xmllint (part of libxml2:)
jonathan@menenius:~/devel/raku/3rdparty-modules/XML$ xmllint --debug exeml_test.xml
DOCUMENT
version=1.0
URL=exeml_test.xml
standalone=true
ELEMENT root
TEXT compact
content=
ELEMENT file
TEXT compact
content=text1
TEXT compact
content=
ELEMENT file
TEXT compact
content=text2
TEXT compact
content=
ELEMENT file
TEXT compact
content=text3
TEXT compact
content=
ELEMENT file
TEXT compact
content=text4
TEXT compact
content=
showing the parsed (empty,) Text nodes.