groovy-common-extensions icon indicating copy to clipboard operation
groovy-common-extensions copied to clipboard

a more sensible approach to map coercion from a nodechild

Open danveloper opened this issue 11 years ago • 28 comments

I think this is a more sensible approach to the NodeChild -> Map translation.

Before, we were taking an XML structure with nested elements and embedding them within a specified 'child' tag, so XML that looked like:

<dan id="1234">
   <key value="val" />
   <key value="val2" />
</dan>

Mapped out to:

assert map == [dan: _children: [ [key: [value: "val" ] ], [key: [value: "val2" ] ] ] ]

This change proposes to change this to:

assert map == [dan: [ key: [ [value: "val"], [value: "val2"] ] ] ]

This change also makes conversion to JSON be an accurate representation of the source XML structure.

Thoughts?

danveloper avatar Jul 26 '13 19:07 danveloper

What happens to

<dan cool="yes">
  <cool>maybe</cool>
<dan>

Apart from the guy who came up with the schema getting their hand slammed in a drawer? ;-)

timyates avatar Jul 26 '13 20:07 timyates

That was a great question because it actually revealed the issue with the code where I wasn't handling values that didn't come from attributes. I think the latest commit is a more appropriate approach to this.

It's not as functional as I'd like, as state has to be maintained (and modified) while traversing the graph, but it does work. Can you think of a more direct way of doing that? It might just be the nature of this feature...

danveloper avatar Jul 27 '13 04:07 danveloper

It seems I have no idea how we can collaborate on a pull request, I've opened https://github.com/timyates/groovy-common-extensions/pull/8 to show how we could support both the old way and the new... But is it getting too complex now?

timyates avatar Jul 29 '13 08:07 timyates

Actually, ignore that, it fails with

<dan id="1234" attr="attrValue">
    <key value="val">
        <key>Tim</key>
    </key>
    <key value="val2" />
</dan>

timyates avatar Jul 29 '13 08:07 timyates

What should:

   <dan id="1234" attr="attrValue">
     <key value="val" />
     <key value="val2" />
     <values id='1234'>
       <value id='a'>One</value>
       <value id='b'>Two</value>
       <value id='c'>Three</value>
     </values>
   </dan>

Return?

It currently gives:

[ dan: [ id: '1234',
           attr:attrValue,
           key: [ [ value:'val' ],
                    [ value:'val2' ] ],
           values: [ id: '1234',
                        value: [ [ id: 'a' ], [ id: 'b' ], [ id: 'c' ] ] ] ] ]

With no mention of one, two or three

(I think this was broken before as well :frowning: )

timyates avatar Jul 29 '13 08:07 timyates

XML->Maps is horrible no? ;-)

timyates avatar Jul 29 '13 09:07 timyates

Yes, quite. I'm not sure how the text() should be represented in the map data structure.

danveloper avatar Jul 29 '13 12:07 danveloper

It's fine when there's no attributes, but then it sort of falls apart :confused:

timyates avatar Jul 29 '13 13:07 timyates

I'm ok with following the way that Perl handles this when coercing XML to a Reference Hash data structure:

<dan>
  <value attr='123' />
</dan>
$VAR1 = {
          'value' => {
                     'attr' => '123'
                   }
        };

---
<dan>
  <value attr='123'>abc</value>
</dan>
$VAR1 = {
          'value' => {
                     'content' => 'abc',
                     'attr' => '123'
                   }
        };

---
<dan>
  <value content='123'>abc</value>
</dan>
$VAR1 = {
          'value' => {
                     'content' => [
                                    '123',
                                    'abc'
                                  ]
                   }
        };

---

Basically taking the inner value and representing it as 'content'.

danveloper avatar Jul 29 '13 13:07 danveloper

What does it do with:

<dan>
  <value>abc</value>
</dan>

?

timyates avatar Jul 29 '13 13:07 timyates

<dan>
  <value>abc</value>
</dan>

$VAR1 = {
          'value' => 'abc'
        };
---

So, basically, if the content is available, that key will be used; if 'content' is also specified as an attribute key, it'll be reused.

danveloper avatar Jul 29 '13 13:07 danveloper

Or, there's the Perl XML::Hash::LX method which gives:

use strict;
use XML::Hash::LX;
use Data::Dumper;

my $xml=<<'EOF';
<dan>
  <value content='123'>abc</value>
</dan>
EOF

my $hash = xml2hash $xml, attr => '.', text => '~';
print Dumper($hash);

as

$VAR1 = {
          'dan' => {
                   'value' => {
                              '~' => 'abc',
                              '.content' => '123'
                            }
                 }
        };

Or again, there's EasyTree which gives the (imho uglier)

use strict;
use XML::Parser;
use XML::Parser::EasyTree;
use Data::Dumper;

$XML::Parser::EasyTree::Noempty=1;
my $xml=<<'EOF';
<dan>
  <value content='123'>abc</value>
</dan>
EOF
my $p=new XML::Parser(Style=>'EasyTree');
my $tree=$p->parse($xml);
print Dumper($tree);

as

$VAR1 = [
          {
            'content' => [
                           {
                             'content' => [
                                            {
                                              'content' => 'abc',
                                              'type' => 't'
                                            }
                                          ],
                             'name' => 'value',
                             'attrib' => {
                                           'content' => '123'
                                         },
                             'type' => 'e'
                           }
                         ],
            'name' => 'dan',
            'attrib' => {},
            'type' => 'e'
          }
        ];

I guess we should also consider what happens with:

<dan>
    <child>
        Tim
        <node/>
        Dan
    </child>
</dan>

timyates avatar Jul 29 '13 14:07 timyates

PS: Reading this back, I should clarify that I'm not being an arse :wink: I just reckon if we get this right, we'll never have to think this deeply about xml again :laughing:

timyates avatar Jul 29 '13 14:07 timyates

You're not being an arse; I didn't even think about your comments being taken that way until you've just said that. (Also, you should know by now that I'm not so soft ;-)).

This is a tough problem to solve. Ultimately, the goal is to map to a data structure that accurately represents the source content...

Although verbose, I do kind of like the last Perl solution, because it's pretty explicit about what/where things are, which means that it becomes easier to programmatically reproduce the source content.

That said, I don't like it because it's very verbose and very strict in its definitions. I think that edge-cases will have to be vetted under that model before we can commit to it.

Nested content models become a much better deal. I'm not sure we'll be able to accurately represent them without some other meta-data in the Map that describes "layout". And when we go down this road, we're starting to get away from the base-idea, which is to represent XML as a Map. We may need to bail on tags nested in content altogether, or represent the content of that tag w/ the embedded XML.

Thoughts?

danveloper avatar Jul 29 '13 14:07 danveloper

I'll have a think...

Came up with this in the mean-time:

def xml = '''<dan root='true'>
            |  <values id='1'>
            |    Tim
            |    <node id='2'>Dave</node>
            |    <node id='3'>Alice</node>
            |    <sub/>
            |    Yates
            |  </values>
            |  <last attr='Mr Woods'/>
            |</dan>'''.stripMargin()

def toMap( node, params=[:] ) {
  String attrPrefix = params.attr ?: '.'
  String textHolder = params.text ?: '#text'
  String textConcat = params.conc ?: '+'
  if( node instanceof String ) {
    return node
  }
  else {
      def m = [:]
      node.attributes()?.each { k, v ->
        m << [ (attrPrefix + k): v ]
      }
      node.children().each {
          def r = toMap( it, params )
          if( r instanceof String ) {
              m[ textHolder ] = m[ textHolder ] ? "${m[textHolder]}$textConcat$r" : r
          }
          else {
              if( m[ it.name() ] ) {
                  m[ it.name() ] = [ m[ it.name() ] ] << r
              }
              else {
                  m[ it.name() ] = r
              }
          }
      }
      m
  }
}

assert toMap( new XmlParser().parseText( xml ) ) == [ '.root' : 'true',
                                                      'values' : [ '.id':'1', '#text':'Tim+Yates', 'node':[ ['.id':'2', '#text':'Dave'],
                                                                                                            ['.id':'3', '#text':'Alice'] ], 'sub':[:] ],
                                                      'last':[ '.attr' : 'Mr Woods' ] ]

timyates avatar Jul 29 '13 15:07 timyates

Of course, XmlParser and XmlSlurper not having a concept of <![CDATA[ ]]> or comments doesn't help...

Think we have to start with a SAX DefaultHandler with an implemented LexicalHandler if we're going to do it right ;-)

timyates avatar Jul 30 '13 10:07 timyates

ie:

import org.xml.sax.*
import org.xml.sax.helpers.DefaultHandler
import org.xml.sax.ext.LexicalHandler
import javax.xml.parsers.SAXParserFactory
import javax.xml.parsers.ParserConfigurationException
import javax.xml.parsers.SAXParser

class ToMapParser extends DefaultHandler implements LexicalHandler {
    private attrsToList( Attributes a ) {
        (0..<a.length).collectEntries {
          [ (a.getLocalName( it ) ?: a.getQName( it )):a.getValue( it ) ]
        }
    }

    void setDocumentLocator(Locator l) {
        println "LOCATOR: SYS ID: ${l.systemId}"
    }

    void startDocument() throws SAXException {
        println "START DOCUMENT"
    }

    void endDocument() throws SAXException {
        println "END DOCUMENT"
    }

    void startElement(String namespaceURI, String sName, String qName, Attributes attrs) throws SAXException {
        println "START ELEMENT $namespaceURI, $sName, $qName, ${attrsToList( attrs )}"
    }

    void endElement(String namespaceURI, String sName, String qName ) throws SAXException {
        println "END ELEMENT $namespaceURI, $sName, $qName"
    }

    void characters(char[] buf, int offset, int len) throws SAXException {
        String s = new String(buf, offset, len);
        println "CHARS $s"
    }

    void ignorableWhitespace( char[] buf, int offset, int len ) throws SAXException {
        String s = new String(buf, offset, len);
        println "IGNORE $s"
    }

    void processingInstruction(String target, String data) throws SAXException {
        println "INSTR $target $data"
    }

    // Lexical Handler stuff
    void comment( char[] ch, int start, int length ) throws SAXException {
        String text = new String( ch, start, length )
        println "COMMENT: $text"
    }

    void startCDATA() throws SAXException {
        println "Start CDATA"
    }

    void endCDATA() throws SAXException {
        println "End CDATA"
    }

    void startEntity( String name ) throws SAXException {
        println "Start ENTITY $name"
    }

    void endEntity( String name ) throws SAXException {
        println "End ENTITY $name"
    }

    void startDTD( String name, String publicId, String systemId ) throws SAXException {
        println "Start DTD"
    }

    void endDTD() throws SAXException {
        println "End DTD"
    }
}

def xml = '''<dan root='true'>
            |  <values id='1'>
            |    Tim
            |    <node id='2'>Dave</node>
            |    <node id='3'>Alice</node>
            |    <![CDATA[ <embedded a='1'>Yes</embedded> ]]>
            |    <sub/>
            |    Yates
            |  </values>
            |  <last attr='Mr Woods'/>
            |</dan>'''.stripMargin()

SAXParserFactory.newInstance().with { factory ->
    validating = true
    newSAXParser().with { parser ->
        parser.XMLReader.with { reader ->
            new ToMapParser().with { handler ->
                parser.setProperty( "http://xml.org/sax/properties/lexical-handler", handler )
                new StringReader( xml ).with { sr ->
                    parse( new InputSource( sr ), handler )
                }
            }
        }
    }
}

timyates avatar Jul 30 '13 10:07 timyates

Maybe something that takes this:

'<?xml version='1.0' encoding='UTF-8'?>
<dan root='true'>
  <?php echo $a; ?>
  <values id='1'>
    Tim
    <node id='2'>Dave</node>
    <node id='3'>Alice</node>
    <![CDATA[ <embedded a='1'>Yes</embedded> ]]>
    <sub/>
    Yates
  </values>
  <last attr='Mr Woods'/>
</dan>

And takes defaults of # to denote TEXT, to denote CDATA, ? to denote instructions, and . to prefix attributes to give:

[ dan: [ [ '.root': 'true' ],
         [ '?': [ 'php': 'echo $a;' ] ],
         [ 'values' : [ [ '.id': '1' ],
                        [ '#' : 'Tim' ],
                        [ 'node' : [ [ '.id': '2' ],
                                     [ '#'  : 'Dave' ] ] ],
                        [ 'node' : [ [ '.id': '3' ],
                                     [ '#'  : 'Alice' ] ] ],
                        [ '€' : "<embedded a='1'>Yes</embedded>" ],
                        [ 'sub' : null ],
                        [ '#' : 'Yates' ] ] ] ],
         [ 'last' : [ [ '.attr' : 'Mr Woods' ] ] ] ]

Is workable? Not sure yet if I like the output structure (Lists of Maps), and not sure if this is actually more useful than the original XML, but I can't see how else we can represent the data at the moment...

timyates avatar Jul 30 '13 10:07 timyates

ie: this:

import org.xml.sax.*
import org.xml.sax.helpers.DefaultHandler
import org.xml.sax.ext.LexicalHandler
import javax.xml.parsers.SAXParserFactory
import javax.xml.parsers.ParserConfigurationException
import javax.xml.parsers.SAXParser

class ToMapParser extends DefaultHandler implements LexicalHandler {
    boolean trimCharacters = true
    String  textKey        = '#'
    String  cdataKey       = '¢'
    String  attrKey        = '.'
    String  nsKey          = '!'
    String  instKey        = '?'
    String  commentKey     = '//'
    Map     ns             = [:]
    boolean validating     = true
    boolean namespaceAware = true

    private boolean inCdata = false
    private Map     root    = [:]
    private def     current = root
    private Stack   parent  = []

    private attrsToList( Attributes a ) {
        (0..<a.length).collect {
          [ ".${(a.getLocalName( it ) ?: a.getQName( it ))}":a.getValue( it ) ]
        }
    }

    void setDocumentLocator( Locator l ) { }
    void startDocument() throws SAXException { }

    void endDocument() throws SAXException {
        ns.each { key, uri ->
            root[ root.keySet()[0] ].add( 0, [ (nsKey): [ (key): uri ] ] )
        }
    }

    void startElement(String namespaceURI, String sName, String qName, Attributes attrs) throws SAXException {
        def tokens = qName.split( ':', 2 )
        if( namespaceAware && tokens.length > 1 ) {
            ns[ tokens[ 0 ] ] = namespaceURI
        }
        else if( namespaceAware && namespaceURI ) {
            ns[ '' ] = namespaceURI
        }
        def child = []
        child.addAll attrsToList( attrs )
        if( !root ) {
            root[ qName ] = child
            current = child
            parent << root
        }
        else {
          current << [ (qName) : child ]
          parent << current
          current = child
        }
    }

    void endElement(String namespaceURI, String sName, String qName ) throws SAXException { current = parent.pop() }

    void characters(char[] buf, int offset, int len) throws SAXException {
        new String( buf, offset, len ).with { s ->
            if( trimCharacters ) s = s.trim()
            if( inCdata ) {
                current << [ (cdataKey): s ]
            }
            else if( s ) {
                current << [ (textKey): s ]
            }
        }
    }

    void processingInstruction(String target, String data) throws SAXException { current << [ (instKey): [ (target): data ] ] }
    void startCDATA() throws SAXException { inCdata = true }
    void endCDATA() throws SAXException { inCdata = false }

    void comment( char[] ch, int start, int length ) throws SAXException {
        new String( ch, start, length ).with { s ->
            if( trimCharacters ) s = s.trim()
            current << [ (commentKey): s ]
        }
    }

    void ignorableWhitespace( char[] buf, int offset, int len ) throws SAXException { throw new UnsupportedOperationException( "Ignorable Whitespace?" ) }
    void startEntity( String name ) throws SAXException { throw new UnsupportedOperationException( "Don't do entities, yet" ) }
    void endEntity( String name ) throws SAXException { throw new UnsupportedOperationException( "Don't do entities, yet" ) }
    void startDTD( String name, String publicId, String systemId ) throws SAXException { throw new UnsupportedOperationException( "Don't do DTDs, yet" ) }
    void endDTD() throws SAXException { throw new UnsupportedOperationException( "Don't do DTDs, yet" ) }

    Map parseXml( String xml ) {
        SAXParserFactory.newInstance().with { factory ->
            factory.validating = this.validating
            factory.namespaceAware = this.namespaceAware
            newSAXParser().with { parser ->
                parser.XMLReader.with { reader ->
                    parser.setProperty( "http://xml.org/sax/properties/lexical-handler", this )
                    new StringReader( xml ).with { sr ->
                        parse( new InputSource( sr ), this )
                        root
                    }
                }
            }
        }
    }
}

def xml = '''<?xml version='1.0' encoding='UTF-8'?>
            |<dan root='true' xmlns="http://ns.com/doc" xmlns:tim="http://ns.com/tim">
            |  <?php echo $a; ?>
            |  <values id='1'>
            |    Tim
            |    <!-- A couple of nodes -->
            |    <tim:node id='2'>Dave</tim:node>
            |    <node id='3'>Alice</node>
            |    <![CDATA[ <embedded a='1'>Yes</embedded> ]]>
            |    <sub/>
            |    Yates
            |  </values>
            |  <last attr='Mr Woods'/>
            |</dan>'''.stripMargin()

new ToMapParser().parseXml( xml )

Returns the map:

['dan':[ ['!':['tim':'http://ns.com/tim']],
         ['!':['':'http://ns.com/doc']],
         [.root:'true'],
         ['?':['php':'echo $a; ']],
         ['values':[ [.id:'1'],
                     ['#':'Tim'],
                     ['//':'A couple of nodes'],
                     ['tim:node':[ [.id:'2'],
                                   ['#':'Dave']]],
                     ['node':[ [.id:'3'],
                               ['#':'Alice']]],
                     ['¢':'<embedded a=\'1\'>Yes</embedded>'],
                     ['sub':[]],
                     ['#':'Yates']]],
         ['last':[[.attr:'Mr Woods']]]]]

Hideous or genius? You decide ;-)

timyates avatar Jul 30 '13 12:07 timyates

Of course, the natural progression for this is:

import org.xml.sax.*
import org.xml.sax.helpers.DefaultHandler
import org.xml.sax.ext.LexicalHandler
import javax.xml.parsers.SAXParserFactory
import javax.xml.parsers.ParserConfigurationException
import javax.xml.parsers.SAXParser
import groovy.transform.*

class Node {
    String name
    ArrayList children
    public Node( String name, List content ) { this.name = name ; this.children = content }
    def    add( int idx, a )                 { children.add( idx, a ) }
    List   children()                        { children }
    String name()                            { name }
    def    namespaces()                      { children.grep( Namespace ).with { r -> r.size() == 1 ? r.head() : r } }
    def    instructions()                    { children.grep( Instruction ).with { r -> r.size() == 1 ? r.head() : r } }
    def    text()                            { children.grep( Text ).with { r -> r.size() == 1 ? r.head().text() : r*.text() } }
    def    cdata()                           { children.grep( Cdata ).with { r -> r.size() == 1 ? r.head() : r } }
    def    comments()                        { children.grep( Comment ).with { r -> r.size() == 1 ? r.head() : r } }
    def    getProperty( String n )           { children.grep( Node ).findAll { it.@name == n }.with { r -> r.size() == 1 ? r.head() : r } }
    AttributeList attributes()               { new AttributeList( children.grep( Attribute ) ) }
    String toString()                        { "[$name: ${children}]" }
}

class AttributeList {
    List content
    AttributeList( List a ) { this.content = a }
    def getProperty( String n )                      { content.findAll { it.key == n }.value.with { r -> r.size() == 1 ? r.head() : r } }
}

class Attribute {
    @Delegate Map$Entry content
    public Attribute( a, b ) { content = [ (a): b ].entrySet().find() }
    String toString()        { "[ATTRIBUTE ${getKey()}: $value]" }
}

class Instruction {
    @Delegate Map$Entry content
    public Instruction( a, b ) { content = [ (a): b ].entrySet().find() }
    String toString()          { "[INSTRUCTION ${getKey()}: $value]" }
}

class Namespace {
    @Delegate Map$Entry content
    public Namespace( a, b ) { content = [ (a): b ].entrySet().find() }
    String toString()        { "[NS '${getKey()}': '$value']" }
}

class Text {
    @Delegate String content
    public Text( String content ) { this.content = content }
    String text()                 { content }
    String toString()             { "[TEXT: $content]" }
}

class Comment {
    @Delegate String content
    public Comment( String content ) { this.content = content }
    String text()                    { content }
    String toString()                { "[COMMENT: $content]" }
}

class Cdata {
    @Delegate String content
    public Cdata( String content ) { this.content = content }
    String text()                  { content }
    String toString()              { "[CDATA: $content]" }
}

class ToMapParser extends DefaultHandler implements LexicalHandler {
    boolean trimCharacters = true
    Map     ns             = [:]
    boolean validating     = true
    boolean namespaceAware = true

    private boolean inCdata = false
    private Node    root    = null
    private def     current = root
    private Stack   parent  = []

    private attrsToList( Attributes a ) {
        (0..<a.length).collect {
            new Attribute( a.getLocalName( it ) ?: a.getQName( it ), a.getValue( it ) )
        }
    }

    void setDocumentLocator( Locator l ) { }
    void startDocument() throws SAXException { }

    void endDocument() throws SAXException {
        ns.each { key, uri ->
            root.add( 0, new Namespace( key, uri ) )
        }
    }

    void startElement(String namespaceURI, String sName, String qName, Attributes attrs) throws SAXException {
        def tokens = qName.split( ':', 2 )
        if( namespaceAware && tokens.length > 1 ) {
            ns[ tokens[ 0 ] ] = namespaceURI
        }
        else if( namespaceAware && namespaceURI ) {
            ns[ '' ] = namespaceURI
        }
        def child = []
        child.addAll attrsToList( attrs )
        if( !root ) {
            root = new Node( qName, child )
            current = child
            parent << root
        }
        else {
          current << new Node( qName, child )
          parent << current
          current = child
        }
    }

    void endElement(String namespaceURI, String sName, String qName ) throws SAXException { current = parent.pop() }

    void characters(char[] buf, int offset, int len) throws SAXException {
        new String( buf, offset, len ).with { s ->
            if( trimCharacters ) s = s.trim()
            if( inCdata ) {
                current << new Cdata( s )
            }
            else if( s ) {
                current << new Text( s )
            }
        }
    }

    void processingInstruction(String target, String data) throws SAXException { current << new Instruction( target, data ) }
    void startCDATA() throws SAXException { inCdata = true }
    void endCDATA() throws SAXException { inCdata = false }

    void comment( char[] ch, int start, int length ) throws SAXException {
        new String( ch, start, length ).with { s ->
            if( trimCharacters ) s = s.trim()
            current << new Comment( s )
        }
    }

    void ignorableWhitespace( char[] buf, int offset, int len ) throws SAXException { throw new UnsupportedOperationException( "Ignorable Whitespace?" ) }
    void startEntity( String name ) throws SAXException { throw new UnsupportedOperationException( "Don't do entities, yet" ) }
    void endEntity( String name ) throws SAXException { throw new UnsupportedOperationException( "Don't do entities, yet" ) }
    void startDTD( String name, String publicId, String systemId ) throws SAXException { throw new UnsupportedOperationException( "Don't do DTDs, yet" ) }
    void endDTD() throws SAXException { throw new UnsupportedOperationException( "Don't do DTDs, yet" ) }

    Node parseXml( String xml ) {
        SAXParserFactory.newInstance().with { factory ->
            factory.validating = this.validating
            factory.namespaceAware = this.namespaceAware
            newSAXParser().with { parser ->
                parser.XMLReader.with { reader ->
                    parser.setProperty( "http://xml.org/sax/properties/lexical-handler", this )
                    new StringReader( xml ).with { sr ->
                        parse( new InputSource( sr ), this )
                        root
                    }
                }
            }
        }
    }
}

def xml = '''<?xml version='1.0' encoding='UTF-8'?>
            |<dan root='true' xmlns="http://ns.com/doc" xmlns:tim="http://ns.com/tim">
            |  <?php echo $a; ?>
            |  <values id='1'>
            |    Tim
            |    <!-- A couple of nodes -->
            |    <tim:node id='2'>Dave</tim:node>
            |    <node id='3'>Alice</node>
            |    <![CDATA[ <embedded a='1'>Yes</embedded> ]]>
            |    <sub/>
            |    Yates
            |  </values>
            |  <last attr='Mr Woods'/>
            |</dan>'''.stripMargin()

Node n = new ToMapParser().parseXml( xml )

assert n.name() == 'dan'
assert n.values.node.attributes().id == '3'
assert n.values.cdata().text() == "<embedded a='1'>Yes</embedded>"
assert n.values.text() == [ 'Tim', 'Yates' ]
assert n.values.node.text() == 'Alice'

But now, we're getting away from the original intent, and rewriting XmlParser :frowning:

timyates avatar Jul 31 '13 11:07 timyates

I thought about this too, and I came to the same conclusion. It would be nice to have an API on top of that very neat data structure that supports comments and CDATA, etc, but then we're off into a whole different world.

However, it is very nice that your solution (n-1 solutions ago) does serialize nicely to JSON, and can be easily recomposed to XML. The downside is that a high fluency in the data structure is required to be able to use it (i.e. what all the tags means).

Things really get hairy when we get into dealing with nodes that have embedded content. What do you think if we make the decision to, when those nodes are identified, we just dump the text() for that field as the value to the property? Essentially just saying we're not going to support that.

I like that solution because it gives the "power" back to the API consumer, saying that if there is Markup embedded in the document, you can handle that in whatever what you see fit. May be useful for attaching meta-data, which is nicely embedded in XML, to content that is used for display & layout. This would bring us back to the "more sensible approach" to dealing with mixed-form XML content.

Thoughts?

danveloper avatar Jul 31 '13 13:07 danveloper

Things really get hairy when we get into dealing with nodes that have embedded content.

Which nodes do you mean? The CDATA ones, or the <a>text<b>and</b>text</a> ones?

It might be worth thinking about that last code, as I think it actually gives you more information back than both the Xml Slurper and Parser as they currently stand... Obviously, this is XML so my code is probably horribly simplistic and XmlParser probably does a much better job with most of it ;-)

But I see what you mean about the XML->Map->Json->Map->XML path... the (n-1) solution does seem to work (so long --as you say-- as you know what all the secret incantation symbols mean) ;-)

timyates avatar Jul 31 '13 14:07 timyates

If we (and I really mean "you", because you're way over my head at this point :-) ) go the direction of building improvements to the XmlSlurper and XmlParser, then they are probably suitable enhancements for groovy-core, whereas our (and I really mean "your", because my small contribution is but a drop in the bucket) extensions project should be meant to provide extensions that make use of existing APIs and concepts. Am I way off on this thought train?

For the above question, I mean the latter. As you well know, we can already get CDATA out of XML, but the embedded markup makes representing the source content in a Map structure a difficult task. If we just say screw it for embedded content, and we pretend as though that was intended to being wrapped in CDATA, then I think we're much better off, and the extension is much more usable.

And then also, the roundtrip lifecycle becomes more easily accomplished.

danveloper avatar Jul 31 '13 14:07 danveloper

I agree ;-)

And currently, the n-1 solution converts that CDATA block to:

                 ['¢':'<embedded a=\'1\'>Yes</embedded>'],

So, as you say, it just grabs the content as a single chunk.

I guess I'll work on a "bizarre map to xml" converter ;-)

timyates avatar Jul 31 '13 15:07 timyates

Awesome :-)

So you do think it's ok to preserve the source text() value in the case where markup is identified within a tag, right?

ie.

<user><var>Tim <emphasis role='bold'>Yates</emphasis></var></user>

would turn out to be:

[user: [var: "Tim <emphasis role='bold'>Yates</emphasis>" ]]

Are we on the same page with that, or no?

danveloper avatar Jul 31 '13 15:07 danveloper

So currently Project n-1 returns that as:

['user':[['var':[['#':'Tim'], ['emphasis':[[.role:'bold'], ['#':'Yates']]]]]]]

Interesting idea to basically have a rule that

if we hit a node with text in it, consider it a leaf and treat it's contents as CDATA

It would simplify things nicely :-)

I would also propose binning all namespace information, comments and processing instructions... :question:

The tricky bit is actually extracting the Tim <emphasis role='bold'>Yates</emphasis> as a String, as the SAX parser will blunder on firing events for it all...

timyates avatar Jul 31 '13 15:07 timyates

I wanted to write some code to demonstrate this, but I figured it ultimately wasn't necessary.

This same conversation that we're having was discussed on xml.com back in 2006.

http://www.xml.com/pub/a/2006/05/31/converting-between-xml-and-json.html

Looks like they came to the same conclusions that we have :-), but they also provide some "standard" (?) ways of tackling the text, annotation, etc... eerily similar to what you've come up with.

danveloper avatar Aug 05 '13 19:08 danveloper

That's spooky :scream_cat:

Last thing I was trying was to extract the text for structured elements

Will try again as soon as I get a chance

timyates avatar Aug 05 '13 19:08 timyates