rdf4j GH-3407 sail level query optimizers

GitHub issue resolved: #3407

Briefly describe the changes proposed in this PR:

Add functionality to the EvaluationStrategyFactory in order to inject new optimisers at runtime.

This functionality should consist of:

a functional interface (supplier) that is provided with EvaluationStrategy strategy, TripleSource tripleSource, EvaluationStatistics evaluationStatistics in order to instantiate a QueryOptimizer
methods on the EvaluationStrategyFactory to add instances of the above mentioned interface (with the ability to specify if the new QueryOptimizer should be inserted at the beginning or end of the current pipeline)
getters for the above
getter for the pipeline in EvaluationStrategy

Create a Value optimiser for the MemoryStore to swap out values in the query with the "interned" value from the MemValueFactory.

PR Author Checklist (see the contributor guidelines for more details):

[ ] my pull request is self-contained
[ ] I've added tests for the changes I made
[ ] I've applied code formatting (you can use mvn process-resources to format from the command line)
[ ] I've squashed my commits where necessary
[ ] every commit message starts with the issue number (GH-xxxx) followed by a meaningful description of the change

Nov 09 '21 10:11 hmottestad

@jeenbroekstra you want to take a look?

Some things I'm uncertain about:

naming
how best to handle updating the Value in a Binding in a BindingSet
if I should extract the new optimizer from the MemoryStore into it's own file

Nov 09 '21 10:11 hmottestad

Sounds good, I'll try and take a closer look later this week.

Nov 09 '21 20:11 abrokenjester

@jeenbroekstra maybe it's easier if I name it something referring to "inject", as in that we can inject optimizers into the existing pipeline.

Nov 15 '21 13:11 hmottestad

I feel this is not quite the right direction for the OptimizerPipeline. I love the idea of store related optimizers but I think this is not a responsibility of the EvaluationStrategy. An query should be optimized before it is given to the EvaluationStrategy to evaluate. There should basically be a QueryOptimizerPipeline per store that builds on the Standard one. Separating this would make it easier to customize the behaviour of the optimizer pipelines. Also less shared state in the EvaluationStrategy would be nice, while cheap building the optimizer pipleline would now become something that must be done for each query instead of once for each store/connection. An other option is to inject this in a EvaluationStrategyFactory with a simple constructor call.

Nov 17 '21 20:11 JervenBolleman

How would I optimize the query before passing it to the evaluation factory?

Nov 17 '21 20:11 hmottestad

Also I think that we are currently building a new evaluation strategy and optimizer pipeline for each query.

Nov 17 '21 21:11 hmottestad

I would just have an optimizer pipeline object and encourage it to be called before the query/tuplexpr is passed in. Something like this in SailSourceConnection

// inside evaluateInternal
TripleSource tripleSource = new SailDatasetTripleSource(vf, rdfDataset);
EvaluationStrategy strategy = getEvaluationStrategy(dataset, tripleSource);
...
//		tupleExpr = strategy.optimize(tupleExpr, store.getEvaluationStatistics(), bindings);
// instead do
tupleExpr = new StandardQueryOptimizerPipeline().optimize(tupleExpr);
// or even better
tupleExpr = getOptimizerPipeline().optimize(tupleExpr);

then getOptimizerPipeline() could be something like

protected QueryOptimizerPipeline getOptimizerPipeline(){
return new QueryOptimizerPipeline(){
@Override
	public Iterable<QueryOptimizer> getOptimizers() {
		return Arrays.asList(
				new BindingAssigner(),
				new BindingSetAssignmentInliner(),
				new ConstantOptimizer(strategy),
				new RegexAsStringFunctionOptimizer(tripleSource.getValueFactory()),
				new CompareOptimizer(),
				new ConjunctiveConstraintSplitter(),
				new DisjunctiveConstraintOptimizer(),
				new SameTermFilterOptimizer(),
				new UnionScopeChangeOptimizer(),
				new QueryModelNormalizer(),
				new QueryJoinOptimizer(evaluationStatistics),
				new IterativeEvaluationOptimizer(),
				new FilterOptimizer(),
				new OrderLimitOptimizer(),
				new ParentReferenceCleaner(),
                                new ReplaceValuesWithStoreNativeInstances(getValueFactory())); // your new optimizer step
	}
} 
}

or just a final variable of the connection

What do you think?

Nov 17 '21 21:11 JervenBolleman

It's currently the strategy that is responsible for optimizing the query. I still feel that it's a natural place to add the extension to support sail level optimizers.

Nov 18 '21 10:11 hmottestad

@hmottestad okay, sounds good to me.

Nov 18 '21 10:11 JervenBolleman

I am ok with this as is. @hmottestad would you mind testing out how this codes by updating the Fedx optimization steps with this approach. As there the strategy is not new per query but retrieved from a federation manager.

Nov 20 '21 14:11 JervenBolleman

I am ok with this as is. @hmottestad would you mind testing out how this codes by updating the Fedx optimization steps with this approach. As there the strategy is not new per query but retrieved from a federation manager.

Fedex doesn't use the EvaluationStrategyFactory interface. So I'm not quite sure where to get started there.

Nov 20 '21 18:11 hmottestad

I really like the actual optimizer step by the way, that is super sweet.

I am ok with this as is. @hmottestad would you mind testing out how this codes by updating the Fedx optimization steps with this approach. As there the strategy is not new per query but retrieved from a federation manager.

Fedex doesn't use the EvaluationStrategyFactory interface. So I'm not quite sure where to get started there.

I opened #3483 to discuss the FedX issue of it not using an EvaluationStrategyFactory. Maybe we can fix that and then this issue is cleaner to apply? what do you think? Somehow I feel that there is cleanup to be done in this section and then the code ends up nicer. Maybe if we make the EvaluationStrategyFactory less mutable I think the OptimizerPipelines might have all the required variables at initialization and we don't need the methods getQueryOptimizersPre etc.

Still that can all be done in a different pull request and there is no need to do that work in this one.

Nov 21 '21 16:11 JervenBolleman

The way the optimizer pipeline can be set from the StrictEvaluationStrategyFactory when instantiating a StrictEvaluationStrategy is quite odd. It first instantiates the strategy, then checks if it should override the pipeline. Instead of just passing in the optimizer pipeline.

Issue is that it doesn't have access to the variables that the StandardQueryOptimizerPipeline needs.

Would probably be a lot cleaner if the pipeline was stored in the EvaluationStrategyFactory and then passed into the EvaluationStrategy.

Maybe a QueryOptimizerPipelineFactory that we set in the EvaluationStrategyFactory, which then generates the pipeline when we instantiate the EvaluationStrategy. The QueryOptimizerPipelineFactory could then cache the optimizers it wants to and reuse them, or create new ones if there are optimizers that are dependent on the transactions state.

Then we can extend a StandardQueryOptimizerPipelineFactory to add our own optimizers to a specific Sail.

Nov 21 '21 18:11 hmottestad

Yep, I want to try the pipeline-factory approach.

Nov 27 '21 06:11 hmottestad

Marked this as stale to signal that the PR hasn't been active for a while and that we should consider closing it.

Jul 24 '22 07:07 hmottestad

rdf4j rdf4j copied to clipboard

GH-3407 sail level query optimizers

Briefly describe the changes proposed in this PR:

rdf4j
rdf4j copied to clipboard