java-client-api
java-client-api copied to clipboard
DOMHandle reuse fields like LSParser
While doing some performance tests in our codebase, I notice that DOMHandle#receiveContent takes up about 50% of a request to MarkLogic. In particular calling createLSParser and newDocumentBuilder took most of the time.
I attached two flamegraphs which you can open in a browser to navigate (hover flamebars to see % time in that function call).
unpatched.svgshows ~53% time inreceiveContent, of that are ~14%createLSParser~20%newDocumentBuilderand rest is a legitDOMParser#parsecall (which i ofc cannot get rid of).patched.svgshows ~20% time inreceiveContentall of that is parsing.
Patched version is run with the code of this PR .
Tested with this code:
public class Main {
public static void main(String[] args) {
final DatabaseClient client = DatabaseClientFactory.newClient("localhost", 8010, new DatabaseClientFactory.DigestAuthContext("admin", "admin"));
final QueryManager qM = client.newQueryManager();
qM.setPageLength(Integer.MAX_VALUE);
final StructuredQueryBuilder sqb = qM.newStructuredQueryBuilder();
final StructureWriteHandle structureWriteHandle = new StringHandle(
"" +
"<search:search xmlns:search=\"http://marklogic.com/appservices/search\">" +
sqb.and(sqb.directory(1, "/sfwordings/")).serialize() +
" <search:options>" +
" <search:extract-document-data selected=\"all\"/>" +
" </search:options>" +
"</search:search>"
).withFormat(Format.XML);
final RawCombinedQueryDefinition def = qM.newRawCombinedQueryDefinition(structureWriteHandle);
long start = System.currentTimeMillis();
for (int i = 0; i < 20; i++) {
doSearch(qM, def);
}
System.out.println(System.currentTimeMillis() - start);
}
private static void doSearch(QueryManager qM, RawCombinedQueryDefinition def) {
final SearchHandle search = qM.search(def, new SearchHandle(), 1);
final DOMHandle handle = new DOMHandle();
final MatchDocumentSummary[] results = search.getMatchResults();
for (MatchDocumentSummary summary : results) {
final ExtractedResult extracted = summary.getExtracted();
if (extracted == null || extracted.isEmpty()) {
continue;
}
for (ExtractedItem item : extracted) {
// this is the crucial call -> invokes DOMHandle#receiveContent
item.get(handle).get();
}
}
}
Problem
If you have a large result set, item.get(handle).get(); may be called alot of times. In my case > 1000 times. Every call of item.get(handle).get(); invokes DOMHandle#receiveContent. Every call to DOMHandle#receiveContent creates a new LSParser.
LSParser seems to be "cachable". The constructed LSParser is configured always the same, if there is no custom resolver or factory configured. In PR I tried to reuse a default factory, document builder and, lsparser.
This leads to a performance gain to about 10% in my environment (queries which use a DOMHandle).
If the opportunity arises, this fix would improve the performance of DOM processing.
DOM will probably always be the least efficient way to processing XML in Java, so this is probably a lower priority.
Closing based on the note from ehennum that addressing DOM performance is a very low priority. Using JDOM2 is recommended instead.