Traverser icon indicating copy to clipboard operation
Traverser copied to clipboard

Post-order traversal on a medium-size and large graph

Open amatiushkin opened this issue 3 years ago • 0 comments

Traverser context builder defines a context strategy (see TraverseContextBuilder.ContextStrategy), which helps to control result during traversal.

In case of post order traversal, the action to set result is invoked once traversal of a current node is complete (hence the name). On a medium and large graphs this final termination could happen very far from start/root node, the path from start to this node could easily include thousands of intermediate nodes.

Current implementation of NORMAL_STRATEGY propagates result from current context to parent, and then from parent to grandparent, which utilize application call stack.

On a JVM with standard configuration (512K), this could generate StackOverflowError once call stack is exceeded. In my case, it took 14-15K nodes in a path to cause the error.

There are several ways to mitigate it.

Increase JVM stack size

Use configuration to increase stack size with java -Xss1M (10M, etc). This is usually enough to enable traversal on a graph with millions nodes.

Create custom context strategy

Another way to avoid stack overflow is to handle result differently by utilizing custom context strategy.

This example is for illustration purposes, it assumes default configuration and expects custom strategy. It gives up ability to call strategy per every result, therefore it is not general enough.

        private static final ContextStrategy CUSTOM_NORMAL_STRATEGY = new ContextStrategy() {
            @Override
            public <U> void setResult(TraverseContextBuilder<?, ?, ?> outer, U result) {
                outer.result = result;
                rootContext(outer).setResult(result);
            }

            @Override
            public <U> U getResult(TraverseContextBuilder<?, ?, ?> outer) {
                return rootContext(outer).getResult();
            }

            private TraverseContextBuilder<?, ?, ?> rootContext(TraverseContextBuilder<?, ?, ?> outer) {
                TraverseContextBuilder<?, ?, ?> parent = outer.parentContext;
                while(parent.contextStrategy == CUSTOM_NORMAL_STRATEGY) { // walk till ROOT
                    parent = parent.parentContext;
                }
                return parent;
            }

            @Override
            public <U> U getVar(TraverseContextBuilder<?, ?, ?> outer, Class<? super U> key) {
                U value;
                return ((value = (U)outer.vars.get(key)) != null || outer.vars.containsKey(key))
                        ? value
                        : outer.parentContext.getVar(key);
            }

            @Override
            public <U> U setVar(TraverseContextBuilder<?, ?, ?> outer, Class<? super U> key, U newValue) {
                return outer.vars.containsKey(key)
                        ? (U)outer.vars.put(key, newValue)
                        : outer.parentContext.setVar(key, newValue);
            }

            @Override
            public String toString() {
                return "CUSTOM_NORMAL_STRATEGY";
            }
        };

Find very unit test to illustrate above in my private gist.

Question to the community

Traverser is designed as a general-purpose library and some limitations and tradeoffs needs to be made.

What is more important for default configuration: ability to traverse large graphs or functionality set? Shall it strive to have both?

PS This is practical question for anyone, who operates with linked and connected data on a scale of wikidata (~100M "entities").

amatiushkin avatar May 31 '22 23:05 amatiushkin