java-client-api icon indicating copy to clipboard operation
java-client-api copied to clipboard

DOMHandle reuse fields like LSParser

Open maffelbaffel opened this issue 6 years ago • 1 comments

While doing some performance tests in our codebase, I notice that DOMHandle#receiveContent takes up about 50% of a request to MarkLogic. In particular calling createLSParser and newDocumentBuilder took most of the time.

I attached two flamegraphs which you can open in a browser to navigate (hover flamebars to see % time in that function call).

  • unpatched.svg shows ~53% time in receiveContent, of that are ~14% createLSParser ~20% newDocumentBuilder and rest is a legit DOMParser#parse call (which i ofc cannot get rid of).
  • patched.svg shows ~20% time in receiveContent all of that is parsing.

Patched version is run with the code of this PR .

Tested with this code:

public class Main {

    public static void main(String[] args) {
        final DatabaseClient client = DatabaseClientFactory.newClient("localhost", 8010, new DatabaseClientFactory.DigestAuthContext("admin", "admin"));

        final QueryManager qM = client.newQueryManager();
        qM.setPageLength(Integer.MAX_VALUE);
        final StructuredQueryBuilder sqb = qM.newStructuredQueryBuilder();
        final StructureWriteHandle structureWriteHandle = new StringHandle(
            "" +
                "<search:search xmlns:search=\"http://marklogic.com/appservices/search\">" +
                sqb.and(sqb.directory(1, "/sfwordings/")).serialize() +
                "   <search:options>" +
                "       <search:extract-document-data selected=\"all\"/>" +
                "   </search:options>" +
                "</search:search>"
        ).withFormat(Format.XML);
        final RawCombinedQueryDefinition def = qM.newRawCombinedQueryDefinition(structureWriteHandle);

        long start = System.currentTimeMillis();
        for (int i = 0; i < 20; i++) {
            doSearch(qM, def);
        }
        System.out.println(System.currentTimeMillis() - start);
    }

    private static void doSearch(QueryManager qM, RawCombinedQueryDefinition def) {
        final SearchHandle search = qM.search(def, new SearchHandle(), 1);

        final DOMHandle handle = new DOMHandle();
        final MatchDocumentSummary[] results = search.getMatchResults();
        for (MatchDocumentSummary summary : results) {
            final ExtractedResult extracted = summary.getExtracted();

            if (extracted == null || extracted.isEmpty()) {
                continue;
            }

            for (ExtractedItem item : extracted) {
                // this is the crucial call -> invokes DOMHandle#receiveContent
                item.get(handle).get();
            }
        }
    }

Problem If you have a large result set, item.get(handle).get(); may be called alot of times. In my case > 1000 times. Every call of item.get(handle).get(); invokes DOMHandle#receiveContent. Every call to DOMHandle#receiveContent creates a new LSParser.

LSParser seems to be "cachable". The constructed LSParser is configured always the same, if there is no custom resolver or factory configured. In PR I tried to reuse a default factory, document builder and, lsparser.

This leads to a performance gain to about 10% in my environment (queries which use a DOMHandle).

flamegraph.zip

maffelbaffel avatar Feb 02 '19 15:02 maffelbaffel

If the opportunity arises, this fix would improve the performance of DOM processing.

DOM will probably always be the least efficient way to processing XML in Java, so this is probably a lower priority.

ehennum avatar Feb 15 '22 19:02 ehennum

Closing based on the note from ehennum that addressing DOM performance is a very low priority. Using JDOM2 is recommended instead.

rjrudin avatar Nov 17 '22 15:11 rjrudin