groovy-common-extensions
groovy-common-extensions copied to clipboard
a more sensible approach to map coercion from a nodechild
I think this is a more sensible approach to the NodeChild -> Map translation.
Before, we were taking an XML structure with nested elements and embedding them within a specified 'child' tag, so XML that looked like:
<dan id="1234">
<key value="val" />
<key value="val2" />
</dan>
Mapped out to:
assert map == [dan: _children: [ [key: [value: "val" ] ], [key: [value: "val2" ] ] ] ]
This change proposes to change this to:
assert map == [dan: [ key: [ [value: "val"], [value: "val2"] ] ] ]
This change also makes conversion to JSON be an accurate representation of the source XML structure.
Thoughts?
What happens to
<dan cool="yes">
<cool>maybe</cool>
<dan>
Apart from the guy who came up with the schema getting their hand slammed in a drawer? ;-)
That was a great question because it actually revealed the issue with the code where I wasn't handling values that didn't come from attributes. I think the latest commit is a more appropriate approach to this.
It's not as functional as I'd like, as state has to be maintained (and modified) while traversing the graph, but it does work. Can you think of a more direct way of doing that? It might just be the nature of this feature...
It seems I have no idea how we can collaborate on a pull request, I've opened https://github.com/timyates/groovy-common-extensions/pull/8 to show how we could support both the old way and the new... But is it getting too complex now?
Actually, ignore that, it fails with
<dan id="1234" attr="attrValue">
<key value="val">
<key>Tim</key>
</key>
<key value="val2" />
</dan>
What should:
<dan id="1234" attr="attrValue">
<key value="val" />
<key value="val2" />
<values id='1234'>
<value id='a'>One</value>
<value id='b'>Two</value>
<value id='c'>Three</value>
</values>
</dan>
Return?
It currently gives:
[ dan: [ id: '1234',
attr:attrValue,
key: [ [ value:'val' ],
[ value:'val2' ] ],
values: [ id: '1234',
value: [ [ id: 'a' ], [ id: 'b' ], [ id: 'c' ] ] ] ] ]
With no mention of one
, two
or three
(I think this was broken before as well :frowning: )
XML->Maps is horrible no? ;-)
Yes, quite. I'm not sure how the text() should be represented in the map data structure.
It's fine when there's no attributes, but then it sort of falls apart :confused:
I'm ok with following the way that Perl handles this when coercing XML to a Reference Hash data structure:
<dan>
<value attr='123' />
</dan>
$VAR1 = {
'value' => {
'attr' => '123'
}
};
---
<dan>
<value attr='123'>abc</value>
</dan>
$VAR1 = {
'value' => {
'content' => 'abc',
'attr' => '123'
}
};
---
<dan>
<value content='123'>abc</value>
</dan>
$VAR1 = {
'value' => {
'content' => [
'123',
'abc'
]
}
};
---
Basically taking the inner value and representing it as 'content'.
What does it do with:
<dan>
<value>abc</value>
</dan>
?
<dan>
<value>abc</value>
</dan>
$VAR1 = {
'value' => 'abc'
};
---
So, basically, if the content is available, that key will be used; if 'content' is also specified as an attribute key, it'll be reused.
Or, there's the Perl XML::Hash::LX method which gives:
use strict;
use XML::Hash::LX;
use Data::Dumper;
my $xml=<<'EOF';
<dan>
<value content='123'>abc</value>
</dan>
EOF
my $hash = xml2hash $xml, attr => '.', text => '~';
print Dumper($hash);
as
$VAR1 = {
'dan' => {
'value' => {
'~' => 'abc',
'.content' => '123'
}
}
};
Or again, there's EasyTree
which gives the (imho uglier)
use strict;
use XML::Parser;
use XML::Parser::EasyTree;
use Data::Dumper;
$XML::Parser::EasyTree::Noempty=1;
my $xml=<<'EOF';
<dan>
<value content='123'>abc</value>
</dan>
EOF
my $p=new XML::Parser(Style=>'EasyTree');
my $tree=$p->parse($xml);
print Dumper($tree);
as
$VAR1 = [
{
'content' => [
{
'content' => [
{
'content' => 'abc',
'type' => 't'
}
],
'name' => 'value',
'attrib' => {
'content' => '123'
},
'type' => 'e'
}
],
'name' => 'dan',
'attrib' => {},
'type' => 'e'
}
];
I guess we should also consider what happens with:
<dan>
<child>
Tim
<node/>
Dan
</child>
</dan>
PS: Reading this back, I should clarify that I'm not being an arse :wink: I just reckon if we get this right, we'll never have to think this deeply about xml again :laughing:
You're not being an arse; I didn't even think about your comments being taken that way until you've just said that. (Also, you should know by now that I'm not so soft ;-)).
This is a tough problem to solve. Ultimately, the goal is to map to a data structure that accurately represents the source content...
Although verbose, I do kind of like the last Perl solution, because it's pretty explicit about what/where things are, which means that it becomes easier to programmatically reproduce the source content.
That said, I don't like it because it's very verbose and very strict in its definitions. I think that edge-cases will have to be vetted under that model before we can commit to it.
Nested content models become a much better deal. I'm not sure we'll be able to accurately represent them without some other meta-data in the Map that describes "layout". And when we go down this road, we're starting to get away from the base-idea, which is to represent XML as a Map. We may need to bail on tags nested in content altogether, or represent the content of that tag w/ the embedded XML.
Thoughts?
I'll have a think...
Came up with this in the mean-time:
def xml = '''<dan root='true'>
| <values id='1'>
| Tim
| <node id='2'>Dave</node>
| <node id='3'>Alice</node>
| <sub/>
| Yates
| </values>
| <last attr='Mr Woods'/>
|</dan>'''.stripMargin()
def toMap( node, params=[:] ) {
String attrPrefix = params.attr ?: '.'
String textHolder = params.text ?: '#text'
String textConcat = params.conc ?: '+'
if( node instanceof String ) {
return node
}
else {
def m = [:]
node.attributes()?.each { k, v ->
m << [ (attrPrefix + k): v ]
}
node.children().each {
def r = toMap( it, params )
if( r instanceof String ) {
m[ textHolder ] = m[ textHolder ] ? "${m[textHolder]}$textConcat$r" : r
}
else {
if( m[ it.name() ] ) {
m[ it.name() ] = [ m[ it.name() ] ] << r
}
else {
m[ it.name() ] = r
}
}
}
m
}
}
assert toMap( new XmlParser().parseText( xml ) ) == [ '.root' : 'true',
'values' : [ '.id':'1', '#text':'Tim+Yates', 'node':[ ['.id':'2', '#text':'Dave'],
['.id':'3', '#text':'Alice'] ], 'sub':[:] ],
'last':[ '.attr' : 'Mr Woods' ] ]
Of course, XmlParser
and XmlSlurper
not having a concept of <![CDATA[ ]]>
or comments doesn't help...
Think we have to start with a SAX DefaultHandler
with an implemented LexicalHandler
if we're going to do it right ;-)
ie:
import org.xml.sax.*
import org.xml.sax.helpers.DefaultHandler
import org.xml.sax.ext.LexicalHandler
import javax.xml.parsers.SAXParserFactory
import javax.xml.parsers.ParserConfigurationException
import javax.xml.parsers.SAXParser
class ToMapParser extends DefaultHandler implements LexicalHandler {
private attrsToList( Attributes a ) {
(0..<a.length).collectEntries {
[ (a.getLocalName( it ) ?: a.getQName( it )):a.getValue( it ) ]
}
}
void setDocumentLocator(Locator l) {
println "LOCATOR: SYS ID: ${l.systemId}"
}
void startDocument() throws SAXException {
println "START DOCUMENT"
}
void endDocument() throws SAXException {
println "END DOCUMENT"
}
void startElement(String namespaceURI, String sName, String qName, Attributes attrs) throws SAXException {
println "START ELEMENT $namespaceURI, $sName, $qName, ${attrsToList( attrs )}"
}
void endElement(String namespaceURI, String sName, String qName ) throws SAXException {
println "END ELEMENT $namespaceURI, $sName, $qName"
}
void characters(char[] buf, int offset, int len) throws SAXException {
String s = new String(buf, offset, len);
println "CHARS $s"
}
void ignorableWhitespace( char[] buf, int offset, int len ) throws SAXException {
String s = new String(buf, offset, len);
println "IGNORE $s"
}
void processingInstruction(String target, String data) throws SAXException {
println "INSTR $target $data"
}
// Lexical Handler stuff
void comment( char[] ch, int start, int length ) throws SAXException {
String text = new String( ch, start, length )
println "COMMENT: $text"
}
void startCDATA() throws SAXException {
println "Start CDATA"
}
void endCDATA() throws SAXException {
println "End CDATA"
}
void startEntity( String name ) throws SAXException {
println "Start ENTITY $name"
}
void endEntity( String name ) throws SAXException {
println "End ENTITY $name"
}
void startDTD( String name, String publicId, String systemId ) throws SAXException {
println "Start DTD"
}
void endDTD() throws SAXException {
println "End DTD"
}
}
def xml = '''<dan root='true'>
| <values id='1'>
| Tim
| <node id='2'>Dave</node>
| <node id='3'>Alice</node>
| <![CDATA[ <embedded a='1'>Yes</embedded> ]]>
| <sub/>
| Yates
| </values>
| <last attr='Mr Woods'/>
|</dan>'''.stripMargin()
SAXParserFactory.newInstance().with { factory ->
validating = true
newSAXParser().with { parser ->
parser.XMLReader.with { reader ->
new ToMapParser().with { handler ->
parser.setProperty( "http://xml.org/sax/properties/lexical-handler", handler )
new StringReader( xml ).with { sr ->
parse( new InputSource( sr ), handler )
}
}
}
}
}
Maybe something that takes this:
'<?xml version='1.0' encoding='UTF-8'?>
<dan root='true'>
<?php echo $a; ?>
<values id='1'>
Tim
<node id='2'>Dave</node>
<node id='3'>Alice</node>
<![CDATA[ <embedded a='1'>Yes</embedded> ]]>
<sub/>
Yates
</values>
<last attr='Mr Woods'/>
</dan>
And takes defaults of #
to denote TEXT, €
to denote CDATA, ?
to denote instructions, and .
to prefix attributes to give:
[ dan: [ [ '.root': 'true' ],
[ '?': [ 'php': 'echo $a;' ] ],
[ 'values' : [ [ '.id': '1' ],
[ '#' : 'Tim' ],
[ 'node' : [ [ '.id': '2' ],
[ '#' : 'Dave' ] ] ],
[ 'node' : [ [ '.id': '3' ],
[ '#' : 'Alice' ] ] ],
[ '€' : "<embedded a='1'>Yes</embedded>" ],
[ 'sub' : null ],
[ '#' : 'Yates' ] ] ] ],
[ 'last' : [ [ '.attr' : 'Mr Woods' ] ] ] ]
Is workable? Not sure yet if I like the output structure (Lists of Maps), and not sure if this is actually more useful than the original XML, but I can't see how else we can represent the data at the moment...
ie: this:
import org.xml.sax.*
import org.xml.sax.helpers.DefaultHandler
import org.xml.sax.ext.LexicalHandler
import javax.xml.parsers.SAXParserFactory
import javax.xml.parsers.ParserConfigurationException
import javax.xml.parsers.SAXParser
class ToMapParser extends DefaultHandler implements LexicalHandler {
boolean trimCharacters = true
String textKey = '#'
String cdataKey = '¢'
String attrKey = '.'
String nsKey = '!'
String instKey = '?'
String commentKey = '//'
Map ns = [:]
boolean validating = true
boolean namespaceAware = true
private boolean inCdata = false
private Map root = [:]
private def current = root
private Stack parent = []
private attrsToList( Attributes a ) {
(0..<a.length).collect {
[ ".${(a.getLocalName( it ) ?: a.getQName( it ))}":a.getValue( it ) ]
}
}
void setDocumentLocator( Locator l ) { }
void startDocument() throws SAXException { }
void endDocument() throws SAXException {
ns.each { key, uri ->
root[ root.keySet()[0] ].add( 0, [ (nsKey): [ (key): uri ] ] )
}
}
void startElement(String namespaceURI, String sName, String qName, Attributes attrs) throws SAXException {
def tokens = qName.split( ':', 2 )
if( namespaceAware && tokens.length > 1 ) {
ns[ tokens[ 0 ] ] = namespaceURI
}
else if( namespaceAware && namespaceURI ) {
ns[ '' ] = namespaceURI
}
def child = []
child.addAll attrsToList( attrs )
if( !root ) {
root[ qName ] = child
current = child
parent << root
}
else {
current << [ (qName) : child ]
parent << current
current = child
}
}
void endElement(String namespaceURI, String sName, String qName ) throws SAXException { current = parent.pop() }
void characters(char[] buf, int offset, int len) throws SAXException {
new String( buf, offset, len ).with { s ->
if( trimCharacters ) s = s.trim()
if( inCdata ) {
current << [ (cdataKey): s ]
}
else if( s ) {
current << [ (textKey): s ]
}
}
}
void processingInstruction(String target, String data) throws SAXException { current << [ (instKey): [ (target): data ] ] }
void startCDATA() throws SAXException { inCdata = true }
void endCDATA() throws SAXException { inCdata = false }
void comment( char[] ch, int start, int length ) throws SAXException {
new String( ch, start, length ).with { s ->
if( trimCharacters ) s = s.trim()
current << [ (commentKey): s ]
}
}
void ignorableWhitespace( char[] buf, int offset, int len ) throws SAXException { throw new UnsupportedOperationException( "Ignorable Whitespace?" ) }
void startEntity( String name ) throws SAXException { throw new UnsupportedOperationException( "Don't do entities, yet" ) }
void endEntity( String name ) throws SAXException { throw new UnsupportedOperationException( "Don't do entities, yet" ) }
void startDTD( String name, String publicId, String systemId ) throws SAXException { throw new UnsupportedOperationException( "Don't do DTDs, yet" ) }
void endDTD() throws SAXException { throw new UnsupportedOperationException( "Don't do DTDs, yet" ) }
Map parseXml( String xml ) {
SAXParserFactory.newInstance().with { factory ->
factory.validating = this.validating
factory.namespaceAware = this.namespaceAware
newSAXParser().with { parser ->
parser.XMLReader.with { reader ->
parser.setProperty( "http://xml.org/sax/properties/lexical-handler", this )
new StringReader( xml ).with { sr ->
parse( new InputSource( sr ), this )
root
}
}
}
}
}
}
def xml = '''<?xml version='1.0' encoding='UTF-8'?>
|<dan root='true' xmlns="http://ns.com/doc" xmlns:tim="http://ns.com/tim">
| <?php echo $a; ?>
| <values id='1'>
| Tim
| <!-- A couple of nodes -->
| <tim:node id='2'>Dave</tim:node>
| <node id='3'>Alice</node>
| <![CDATA[ <embedded a='1'>Yes</embedded> ]]>
| <sub/>
| Yates
| </values>
| <last attr='Mr Woods'/>
|</dan>'''.stripMargin()
new ToMapParser().parseXml( xml )
Returns the map:
['dan':[ ['!':['tim':'http://ns.com/tim']],
['!':['':'http://ns.com/doc']],
[.root:'true'],
['?':['php':'echo $a; ']],
['values':[ [.id:'1'],
['#':'Tim'],
['//':'A couple of nodes'],
['tim:node':[ [.id:'2'],
['#':'Dave']]],
['node':[ [.id:'3'],
['#':'Alice']]],
['¢':'<embedded a=\'1\'>Yes</embedded>'],
['sub':[]],
['#':'Yates']]],
['last':[[.attr:'Mr Woods']]]]]
Hideous or genius? You decide ;-)
Of course, the natural progression for this is:
import org.xml.sax.*
import org.xml.sax.helpers.DefaultHandler
import org.xml.sax.ext.LexicalHandler
import javax.xml.parsers.SAXParserFactory
import javax.xml.parsers.ParserConfigurationException
import javax.xml.parsers.SAXParser
import groovy.transform.*
class Node {
String name
ArrayList children
public Node( String name, List content ) { this.name = name ; this.children = content }
def add( int idx, a ) { children.add( idx, a ) }
List children() { children }
String name() { name }
def namespaces() { children.grep( Namespace ).with { r -> r.size() == 1 ? r.head() : r } }
def instructions() { children.grep( Instruction ).with { r -> r.size() == 1 ? r.head() : r } }
def text() { children.grep( Text ).with { r -> r.size() == 1 ? r.head().text() : r*.text() } }
def cdata() { children.grep( Cdata ).with { r -> r.size() == 1 ? r.head() : r } }
def comments() { children.grep( Comment ).with { r -> r.size() == 1 ? r.head() : r } }
def getProperty( String n ) { children.grep( Node ).findAll { it.@name == n }.with { r -> r.size() == 1 ? r.head() : r } }
AttributeList attributes() { new AttributeList( children.grep( Attribute ) ) }
String toString() { "[$name: ${children}]" }
}
class AttributeList {
List content
AttributeList( List a ) { this.content = a }
def getProperty( String n ) { content.findAll { it.key == n }.value.with { r -> r.size() == 1 ? r.head() : r } }
}
class Attribute {
@Delegate Map$Entry content
public Attribute( a, b ) { content = [ (a): b ].entrySet().find() }
String toString() { "[ATTRIBUTE ${getKey()}: $value]" }
}
class Instruction {
@Delegate Map$Entry content
public Instruction( a, b ) { content = [ (a): b ].entrySet().find() }
String toString() { "[INSTRUCTION ${getKey()}: $value]" }
}
class Namespace {
@Delegate Map$Entry content
public Namespace( a, b ) { content = [ (a): b ].entrySet().find() }
String toString() { "[NS '${getKey()}': '$value']" }
}
class Text {
@Delegate String content
public Text( String content ) { this.content = content }
String text() { content }
String toString() { "[TEXT: $content]" }
}
class Comment {
@Delegate String content
public Comment( String content ) { this.content = content }
String text() { content }
String toString() { "[COMMENT: $content]" }
}
class Cdata {
@Delegate String content
public Cdata( String content ) { this.content = content }
String text() { content }
String toString() { "[CDATA: $content]" }
}
class ToMapParser extends DefaultHandler implements LexicalHandler {
boolean trimCharacters = true
Map ns = [:]
boolean validating = true
boolean namespaceAware = true
private boolean inCdata = false
private Node root = null
private def current = root
private Stack parent = []
private attrsToList( Attributes a ) {
(0..<a.length).collect {
new Attribute( a.getLocalName( it ) ?: a.getQName( it ), a.getValue( it ) )
}
}
void setDocumentLocator( Locator l ) { }
void startDocument() throws SAXException { }
void endDocument() throws SAXException {
ns.each { key, uri ->
root.add( 0, new Namespace( key, uri ) )
}
}
void startElement(String namespaceURI, String sName, String qName, Attributes attrs) throws SAXException {
def tokens = qName.split( ':', 2 )
if( namespaceAware && tokens.length > 1 ) {
ns[ tokens[ 0 ] ] = namespaceURI
}
else if( namespaceAware && namespaceURI ) {
ns[ '' ] = namespaceURI
}
def child = []
child.addAll attrsToList( attrs )
if( !root ) {
root = new Node( qName, child )
current = child
parent << root
}
else {
current << new Node( qName, child )
parent << current
current = child
}
}
void endElement(String namespaceURI, String sName, String qName ) throws SAXException { current = parent.pop() }
void characters(char[] buf, int offset, int len) throws SAXException {
new String( buf, offset, len ).with { s ->
if( trimCharacters ) s = s.trim()
if( inCdata ) {
current << new Cdata( s )
}
else if( s ) {
current << new Text( s )
}
}
}
void processingInstruction(String target, String data) throws SAXException { current << new Instruction( target, data ) }
void startCDATA() throws SAXException { inCdata = true }
void endCDATA() throws SAXException { inCdata = false }
void comment( char[] ch, int start, int length ) throws SAXException {
new String( ch, start, length ).with { s ->
if( trimCharacters ) s = s.trim()
current << new Comment( s )
}
}
void ignorableWhitespace( char[] buf, int offset, int len ) throws SAXException { throw new UnsupportedOperationException( "Ignorable Whitespace?" ) }
void startEntity( String name ) throws SAXException { throw new UnsupportedOperationException( "Don't do entities, yet" ) }
void endEntity( String name ) throws SAXException { throw new UnsupportedOperationException( "Don't do entities, yet" ) }
void startDTD( String name, String publicId, String systemId ) throws SAXException { throw new UnsupportedOperationException( "Don't do DTDs, yet" ) }
void endDTD() throws SAXException { throw new UnsupportedOperationException( "Don't do DTDs, yet" ) }
Node parseXml( String xml ) {
SAXParserFactory.newInstance().with { factory ->
factory.validating = this.validating
factory.namespaceAware = this.namespaceAware
newSAXParser().with { parser ->
parser.XMLReader.with { reader ->
parser.setProperty( "http://xml.org/sax/properties/lexical-handler", this )
new StringReader( xml ).with { sr ->
parse( new InputSource( sr ), this )
root
}
}
}
}
}
}
def xml = '''<?xml version='1.0' encoding='UTF-8'?>
|<dan root='true' xmlns="http://ns.com/doc" xmlns:tim="http://ns.com/tim">
| <?php echo $a; ?>
| <values id='1'>
| Tim
| <!-- A couple of nodes -->
| <tim:node id='2'>Dave</tim:node>
| <node id='3'>Alice</node>
| <![CDATA[ <embedded a='1'>Yes</embedded> ]]>
| <sub/>
| Yates
| </values>
| <last attr='Mr Woods'/>
|</dan>'''.stripMargin()
Node n = new ToMapParser().parseXml( xml )
assert n.name() == 'dan'
assert n.values.node.attributes().id == '3'
assert n.values.cdata().text() == "<embedded a='1'>Yes</embedded>"
assert n.values.text() == [ 'Tim', 'Yates' ]
assert n.values.node.text() == 'Alice'
But now, we're getting away from the original intent, and rewriting XmlParser :frowning:
I thought about this too, and I came to the same conclusion. It would be nice to have an API on top of that very neat data structure that supports comments and CDATA, etc, but then we're off into a whole different world.
However, it is very nice that your solution (n-1 solutions ago) does serialize nicely to JSON, and can be easily recomposed to XML. The downside is that a high fluency in the data structure is required to be able to use it (i.e. what all the tags means).
Things really get hairy when we get into dealing with nodes that have embedded content. What do you think if we make the decision to, when those nodes are identified, we just dump the text() for that field as the value to the property? Essentially just saying we're not going to support that.
I like that solution because it gives the "power" back to the API consumer, saying that if there is Markup embedded in the document, you can handle that in whatever what you see fit. May be useful for attaching meta-data, which is nicely embedded in XML, to content that is used for display & layout. This would bring us back to the "more sensible approach" to dealing with mixed-form XML content.
Thoughts?
Things really get hairy when we get into dealing with nodes that have embedded content.
Which nodes do you mean? The CDATA
ones, or the <a>text<b>and</b>text</a>
ones?
It might be worth thinking about that last code, as I think it actually gives you more information back than both the Xml Slurper and Parser as they currently stand... Obviously, this is XML so my code is probably horribly simplistic and XmlParser probably does a much better job with most of it ;-)
But I see what you mean about the XML->Map->Json->Map->XML path... the (n-1) solution does seem to work (so long --as you say-- as you know what all the secret incantation symbols mean) ;-)
If we (and I really mean "you", because you're way over my head at this point :-) ) go the direction of building improvements to the XmlSlurper and XmlParser, then they are probably suitable enhancements for groovy-core, whereas our (and I really mean "your", because my small contribution is but a drop in the bucket) extensions project should be meant to provide extensions that make use of existing APIs and concepts. Am I way off on this thought train?
For the above question, I mean the latter. As you well know, we can already get CDATA out of XML, but the embedded markup makes representing the source content in a Map structure a difficult task. If we just say screw it for embedded content, and we pretend as though that was intended to being wrapped in CDATA, then I think we're much better off, and the extension is much more usable.
And then also, the roundtrip lifecycle becomes more easily accomplished.
I agree ;-)
And currently, the n-1 solution converts that CDATA block to:
['¢':'<embedded a=\'1\'>Yes</embedded>'],
So, as you say, it just grabs the content as a single chunk.
I guess I'll work on a "bizarre map to xml" converter ;-)
Awesome :-)
So you do think it's ok to preserve the source text() value in the case where markup is identified within a tag, right?
ie.
<user><var>Tim <emphasis role='bold'>Yates</emphasis></var></user>
would turn out to be:
[user: [var: "Tim <emphasis role='bold'>Yates</emphasis>" ]]
Are we on the same page with that, or no?
So currently Project n-1
returns that as:
['user':[['var':[['#':'Tim'], ['emphasis':[[.role:'bold'], ['#':'Yates']]]]]]]
Interesting idea to basically have a rule that
if we hit a node with text in it, consider it a leaf and treat it's contents as CDATA
It would simplify things nicely :-)
I would also propose binning all namespace information, comments and processing instructions... :question:
The tricky bit is actually extracting the Tim <emphasis role='bold'>Yates</emphasis>
as a String, as the SAX parser will blunder on firing events for it all...
I wanted to write some code to demonstrate this, but I figured it ultimately wasn't necessary.
This same conversation that we're having was discussed on xml.com back in 2006.
http://www.xml.com/pub/a/2006/05/31/converting-between-xml-and-json.html
Looks like they came to the same conclusions that we have :-), but they also provide some "standard" (?) ways of tackling the text, annotation, etc... eerily similar to what you've come up with.
That's spooky :scream_cat:
Last thing I was trying was to extract the text for structured elements
Will try again as soon as I get a chance