rdf4j
rdf4j copied to clipboard
GH-3407 sail level query optimizers
GitHub issue resolved: #3407
Briefly describe the changes proposed in this PR:
Add functionality to the EvaluationStrategyFactory in order to inject new optimisers at runtime.
This functionality should consist of:
- a functional interface (supplier) that is provided with
EvaluationStrategy strategy, TripleSource tripleSource, EvaluationStatistics evaluationStatisticsin order to instantiate aQueryOptimizer - methods on the
EvaluationStrategyFactoryto add instances of the above mentioned interface (with the ability to specify if the newQueryOptimizershould be inserted at the beginning or end of the current pipeline) - getters for the above
- getter for the pipeline in
EvaluationStrategy
Create a Value optimiser for the MemoryStore to swap out values in the query with the "interned" value from the MemValueFactory.
PR Author Checklist (see the contributor guidelines for more details):
- [ ] my pull request is self-contained
- [ ] I've added tests for the changes I made
- [ ] I've applied code formatting (you can use
mvn process-resourcesto format from the command line) - [ ] I've squashed my commits where necessary
- [ ] every commit message starts with the issue number (GH-xxxx) followed by a meaningful description of the change
@jeenbroekstra you want to take a look?
Some things I'm uncertain about:
- naming
- how best to handle updating the Value in a Binding in a BindingSet
- if I should extract the new optimizer from the MemoryStore into it's own file
Sounds good, I'll try and take a closer look later this week.
@jeenbroekstra maybe it's easier if I name it something referring to "inject", as in that we can inject optimizers into the existing pipeline.
I feel this is not quite the right direction for the OptimizerPipeline. I love the idea of store related optimizers but I think this is not a responsibility of the EvaluationStrategy. An query should be optimized before it is given to the EvaluationStrategy to evaluate. There should basically be a QueryOptimizerPipeline per store that builds on the Standard one. Separating this would make it easier to customize the behaviour of the optimizer pipelines. Also less shared state in the EvaluationStrategy would be nice, while cheap building the optimizer pipleline would now become something that must be done for each query instead of once for each store/connection. An other option is to inject this in a EvaluationStrategyFactory with a simple constructor call.
How would I optimize the query before passing it to the evaluation factory?
Also I think that we are currently building a new evaluation strategy and optimizer pipeline for each query.
I would just have an optimizer pipeline object and encourage it to be called before the query/tuplexpr is passed in.
Something like this in SailSourceConnection
// inside evaluateInternal
TripleSource tripleSource = new SailDatasetTripleSource(vf, rdfDataset);
EvaluationStrategy strategy = getEvaluationStrategy(dataset, tripleSource);
...
// tupleExpr = strategy.optimize(tupleExpr, store.getEvaluationStatistics(), bindings);
// instead do
tupleExpr = new StandardQueryOptimizerPipeline().optimize(tupleExpr);
// or even better
tupleExpr = getOptimizerPipeline().optimize(tupleExpr);
then getOptimizerPipeline() could be something like
protected QueryOptimizerPipeline getOptimizerPipeline(){
return new QueryOptimizerPipeline(){
@Override
public Iterable<QueryOptimizer> getOptimizers() {
return Arrays.asList(
new BindingAssigner(),
new BindingSetAssignmentInliner(),
new ConstantOptimizer(strategy),
new RegexAsStringFunctionOptimizer(tripleSource.getValueFactory()),
new CompareOptimizer(),
new ConjunctiveConstraintSplitter(),
new DisjunctiveConstraintOptimizer(),
new SameTermFilterOptimizer(),
new UnionScopeChangeOptimizer(),
new QueryModelNormalizer(),
new QueryJoinOptimizer(evaluationStatistics),
new IterativeEvaluationOptimizer(),
new FilterOptimizer(),
new OrderLimitOptimizer(),
new ParentReferenceCleaner(),
new ReplaceValuesWithStoreNativeInstances(getValueFactory())); // your new optimizer step
}
}
}
or just a final variable of the connection
What do you think?
It's currently the strategy that is responsible for optimizing the query. I still feel that it's a natural place to add the extension to support sail level optimizers.
@hmottestad okay, sounds good to me.
I am ok with this as is. @hmottestad would you mind testing out how this codes by updating the Fedx optimization steps with this approach. As there the strategy is not new per query but retrieved from a federation manager.
I am ok with this as is. @hmottestad would you mind testing out how this codes by updating the Fedx optimization steps with this approach. As there the strategy is not new per query but retrieved from a federation manager.
Fedex doesn't use the EvaluationStrategyFactory interface. So I'm not quite sure where to get started there.
I really like the actual optimizer step by the way, that is super sweet.
I am ok with this as is. @hmottestad would you mind testing out how this codes by updating the Fedx optimization steps with this approach. As there the strategy is not new per query but retrieved from a federation manager.
Fedex doesn't use the
EvaluationStrategyFactoryinterface. So I'm not quite sure where to get started there.
I opened #3483 to discuss the FedX issue of it not using an EvaluationStrategyFactory. Maybe we can fix that and then this issue is cleaner to apply? what do you think? Somehow I feel that there is cleanup to be done in this section and then the code ends up nicer. Maybe if we make the EvaluationStrategyFactory less mutable I think the OptimizerPipelines might have all the required variables at initialization and we don't need the methods getQueryOptimizersPre etc.
Still that can all be done in a different pull request and there is no need to do that work in this one.
The way the optimizer pipeline can be set from the StrictEvaluationStrategyFactory when instantiating a StrictEvaluationStrategy is quite odd. It first instantiates the strategy, then checks if it should override the pipeline. Instead of just passing in the optimizer pipeline.
Issue is that it doesn't have access to the variables that the StandardQueryOptimizerPipeline needs.
Would probably be a lot cleaner if the pipeline was stored in the EvaluationStrategyFactory and then passed into the EvaluationStrategy.
Maybe a QueryOptimizerPipelineFactory that we set in the EvaluationStrategyFactory, which then generates the pipeline when we instantiate the EvaluationStrategy. The QueryOptimizerPipelineFactory could then cache the optimizers it wants to and reuse them, or create new ones if there are optimizers that are dependent on the transactions state.
Then we can extend a StandardQueryOptimizerPipelineFactory to add our own optimizers to a specific Sail.
Yep, I want to try the pipeline-factory approach.
Marked this as stale to signal that the PR hasn't been active for a while and that we should consider closing it.