jdom
jdom copied to clipboard
Handlers
Instead of parsing the whole XML (from a file, for example) into a Document, one could specify different handlers for different XPath queries. When the specific part of the whole XML is parsed that part can be detached, so the resulting Document is much smaller or even empty.
Example:
- XML:
row 1
row 1
...row n
- java code: SAXBuilder builder = new SAXBuilder(); builder.addHandler("/rows/row", new ElementHandler() { public void onElement(Element element) { // now you can get attributes, values, children of specific element. This method is, for the above example, called n times. element.detach(); // signal that this element should not be included in the final Document } }); // builder.addHandler can be called multiple times for different XPath queries. Document doc = builder.build(new File("example.xml")); // at this point Document does not have any row Elements.
Using this approach, parsing large files would not consume much RAM and we could use all advantages of DOM trees.
It's true that parsing the entire DOM consumes memory. On the other hand, in instances where memory is a problem, traditional SAX building and processing just the events would be the most efficient (speed and memory).
There is a compromise available in JDOM >2.x using the StAXBuilder which allows you to filter events and to produce a List of content that matches the filter.
Finally, it is always possible to create your own JDOMFactory to pass to the SAAXBuilder, and have the JDOMFactory manipulate the JDOM tree as it is constructed, in order to reduce the preserved nodes.
As an additional note, it is fairly hard, if not impossible, to use XPaths as the document is being built. Only a limited set of XPath would be possible (direct paths, etc.) - relative paths, neighbor-values, calculations, size, values, and other XPath queries would not exist, necessarily when the decisions is being made to keep, or prune the path.
Your feature request is, in part, supported through other mechanisms, and would be very hard, if not impossible, to support in the way you describe. Am I missing something?
XPath that only supports direct paths would be enough, since, as you said, it is almost impossible to use full XPath features as the tree is in construction phase. I don't think, but I don't know the internal structure of the JDom, wouls be that hard to implement since dom4j has this feature for many years. Please look at "How does dom4j handle very large XML documents?" at http://dom4j.sourceforge.net/dom4j-1.6.1/faq.html#large-doc
Let's consider a Java8-version of the build process where it would be nice to be able to do:
SAXBuilder builder = new SAXBuilder();
Document doc = builder.build("path/to/data.xml");
doc.stream("/xpath/expression", Filters.element()).parallel().foreach(....);
(as an aside, this new feature you are suggesting won'd be in a 2.0.x version of JDOM, but a 2.1 version, which may as well require Java8 .... right?)
What you are suggesting is: "How about taking that concept of streaming the content back a step, and integrating it with the build?"
Let's separate it from the SAXBuilder API for compatibility reasons, and make a SAXStreamer API, which is similar to the builder, but takes an additional "FlowControl" system that identifies tokens in the stream:
FlowControl<Element> control = new XPathFlowControl<>("/xpath/expression", Filters.element());
SAXStreamer<Element> streamer = new SAXStreamer(control);
streamer.stream("path/to/data.xml").foreach(element -> System.out.println(element.value()));
hmmm, that could be done as a generic method:
FlowControl<Element> control = new XPathFlowControl<>("/xpath/expression", Filters.element());
SAXStreamer streamer = new SAXStreamer();
streamer.stream("path/to/data.xml", control).foreach(element -> System.out.println(element.value()));
Having a callback mechanism for that would be simple to extend:
@FunctionalInterface
public interface ContentHandler<T extends Content> {
public void handleContent(T content);
}
then you could have the code:
public <T> void events(FlowControl<T> control, *Source* source, ContentHandler<T> handler) {
stream(control, source).foreach(content -> handler.handleContent(content));
}
I can see the potential in this, but it will need to be specced out a bunch more.
I think this is it. Just please make sure that I can use multiple FlowControl objects, so that for example:
<rows>
<header> ...</header>
<details><detail>...</detail><detail>...</detail>...</details>
<footer>...</footer>
</rows>
I could handle /rows/header
and /rows/details
during one building of a tree.
Maybe something like:
ContentHandler headerHandler = new ContentHandler();
ContentHandler detailsHandler = new ContentHandler();
streamer.addHandler("/rows/header", headerHandler);
streamer.addHandler("/rows/details", detailsHandler);
streamer.stream(....);