XML icon indicating copy to clipboard operation
XML copied to clipboard

Node 'indexing' from root gives strange `mod 2` result (alternating populated/unpopulated lines)

Open jubilatious1 opened this issue 1 year ago • 6 comments

While trying to reproduce some examples (from the test folder), I stumbled onto an indexing oddity.

Below, $xml.root[1], $xml.root[3], $xml.root[5], and $xml.root[7] return text.

Conversely, $xml.root[0], $xml.root[2], $xml.root[4], $xml.root[6], and $xml.root[8] return blank lines.

Is this canonical XML behavior? Thx.

~$ raku -MXML -e 'my $xml=open-xml($*ARGFILES.Str); $xml.say;' ~/exemel_text.xml
<?xml version="1.0"?><root>
  <file>text1</file>
  <file>text2</file>
  <file>text3</file>
  <file>text4</file>
</root>
~$ raku -MXML -e 'my $xml=open-xml($*ARGFILES.Str); $xml.root.say;' ~/exemel_text.xml
<root>
  <file>text1</file>
  <file>text2</file>
  <file>text3</file>
  <file>text4</file>
</root>
~$ raku -MXML -e 'my $xml=open-xml($*ARGFILES.Str); $xml.root[0].say;' ~/exemel_text.xml


~$ raku -MXML -e 'my $xml=open-xml($*ARGFILES.Str); $xml.root.[0].say;' ~/exemel_text.xml


~$ raku -MXML -e 'my $xml=open-xml($*ARGFILES.Str); $xml.root.[1].say;' ~/exemel_text.xml
<file>text1</file>
~$ raku -MXML -e 'my $xml=open-xml($*ARGFILES.Str); $xml.root.[2].say;' ~/exemel_text.xml


~$ raku -MXML -e 'my $xml=open-xml($*ARGFILES.Str); $xml.root.[3].say;' ~/exemel_text.xml
<file>text2</file>
~$ raku -MXML -e 'my $xml=open-xml($*ARGFILES.Str); $xml.root.[4].say;' ~/exemel_text.xml


~$ raku -MXML -e 'my $xml=open-xml($*ARGFILES.Str); $xml.root.[5].say;' ~/exemel_text.xml
<file>text3</file>
~$ raku -MXML -e 'my $xml=open-xml($*ARGFILES.Str); $xml.root.[6].say;' ~/exemel_text.xml


~$ raku -MXML -e 'my $xml=open-xml($*ARGFILES.Str); $xml.root.[7].say;' ~/exemel_text.xml
<file>text4</file>
~$ raku -MXML -e 'my $xml=open-xml($*ARGFILES.Str); $xml.root.[8].say;' ~/exemel_text.xml


~$ raku -MXML -e 'my $xml=open-xml($*ARGFILES.Str); $xml.root.[9].say;' ~/exemel_text.xml
(Any)
~$

Rakudo 2023.05 / MacOS; XML:ver<0.3.3>:auth<zef:raku-community-modules>

@supernovus @jonathanstowe

jubilatious1 avatar Feb 27 '24 23:02 jubilatious1

Maybe related to #20 ?

jubilatious1 avatar Feb 28 '24 00:02 jubilatious1

raku -MXML -e 'my $xml=open-xml($*ARGFILES.Str); for ^$xml.root.nodes.elems { say "$_:"; say $xml.root[$_].^name.indent(4); say $xml.root[$_].gist.raku; }' exemel_text.xml
0:
    XML::Text
    "\n  "
1:
    XML::Element
    "<file>text1</file>"
2:
    XML::Text
    "\n  "
3:
    XML::Element
    "<file>text2</file>"
4:
    XML::Text
    "\n  "
5:
    XML::Element
    "<file>text3</file>"
6:
    XML::Text
    "\n  "
7:
    XML::Element
    "<file>text4</file>"
8:
    XML::Text
    "\n"

Compare:

cat exemel_text_nospace.xml 
<?xml version="1.0"?><root><file>text1</file><file>text2</file><file>text3</file><file>text4</file></root>
raku -MXML -e 'my $xml=open-xml($*ARGFILES.Str); for ^$xml.root.nodes.elems { say "$_:"; say $xml.root[$_].^name.indent(4); say $xml.root[$_].gist.raku.indent(4); }' exemel_text_nospace.xml
0:
    XML::Element
    "<file>text1</file>"
1:
    XML::Element
    "<file>text2</file>"
2:
    XML::Element
    "<file>text3</file>"
3:
    XML::Element
    "<file>text4</file>"

I believe this is required by the XML specification.

I would favor closing this as "not a bug".

timo avatar Feb 19 '25 21:02 timo

Could someone point out the XML requirement, in the specification document below? Thx.

https://www.w3.org/TR/2008/REC-xml-20081126/

jubilatious1 avatar Feb 20 '25 19:02 jubilatious1

https://www.w3.org/TR/2008/REC-xml-20081126/#sec-white-space

In editing XML documents, it is often convenient to use "white space" (spaces, tabs, and blank lines) to set apart the markup for greater readability. Such white space is typically not intended for inclusion in the delivered version of the document. On the other hand, "significant" white space that should be preserved in the delivered version is common, for example in poetry and source code. An XML processor MUST always pass all characters in a document that are not markup through to the application. A validating XML processor MUST also inform the application which of these characters constitute white space appearing in element content.

I haven't checked yet but I don't think this module validates, in other words has support for DTDs.

The following seems to be intended for validating processors as well, unless we can just say "we're not a validating processor, so we will also accept xml:space on elements where we don't have it declared as an attribute?

A special attribute named xml:space may be attached to an element to signal an intention that in that element, white space should be preserved by applications. In valid documents, this attribute, like any other, MUST be declared if it is used.

timo avatar Feb 20 '25 19:02 timo

Hi @timo , your comments are beyond my pay-grade. Maybe you and @jonathanstowe can come to some conclusion on this? Otherwise I believe this Issue is in a condition where it is ready to be closed.

Thanks for all your code investigations above!

jubilatious1 avatar Feb 20 '25 20:02 jubilatious1

I concur with @timo's conclusion. To put it at its simplest: the Text nodes must be generated there because something like:

<?xml version="1.0"?>
<root>
  <file>text1</file>
  outside
  <file>text2</file>
  outside
  <file>text3</file>
  outside
  <file>text4</file>
  outside
</root>

could be perfectly valid, (in XSD terms root can be a complex type with mixed content,) and because the parser has no way of knowing (not validating against e.g. XSD or DTD,) what was intended the whitespace there should be considered as part of the content rather than being ignorable.

Just for completeness:

Essentially this module is giving you a fairly raw parse tree (with some sugar,) so when you index the element accessor you get all the child nodes (of any type, i.e. it is using the .nodes accessor of the element object) so if you had something like:

<?xml version="1.0"?>
<root>
  <?xml-stylesheet type="text/xsl" href="style.xsl"?>
  <file>text1</file>
  <!-- comment -->
  <file>text2</file>
  <file>text3</file>
  <file>text4</file>
</root>

And do:

raku -I. -MXML -e 'my $xml=open-xml($*ARGFILES.Str); for $xml.root.nodes.list -> $node { say $node.^name }' exeml_test_all.xml

You get:

XML::Text
XML::PI
XML::Text
XML::Element
XML::Text
XML::Comment
XML::Text
XML::Element
XML::Text
XML::Element
XML::Text
XML::Element
XML::Text

If you know the structure of your XML and know that it isn't "mixed" content (so you are only interested in the Element children,) then you would use the .elements accessor of the element:

raku -I. -MXML -e 'my $xml=open-xml($*ARGFILES.Str); say $xml.root.elements[0]' exeml_test_all.xml 
<file>text1</file>

This may or may not be DWIMy depending on your requirements, but it definitely is consistent with the XML specifications.

And finally, just to show that this isn't just an idiosyncrasy of this module, this is the parse tree of the original document from xmllint (part of libxml2:)

 jonathan@menenius:~/devel/raku/3rdparty-modules/XML$ xmllint --debug exeml_test.xml 
DOCUMENT
version=1.0
URL=exeml_test.xml
standalone=true
  ELEMENT root
    TEXT compact
      content=   
    ELEMENT file
      TEXT compact
        content=text1
    TEXT compact
      content=   
    ELEMENT file
      TEXT compact
        content=text2
    TEXT compact
      content=   
    ELEMENT file
      TEXT compact
        content=text3
    TEXT compact
      content=   
    ELEMENT file
      TEXT compact
        content=text4
    TEXT compact
      content= 

showing the parsed (empty,) Text nodes.

jonathanstowe avatar Feb 22 '25 10:02 jonathanstowe