jena icon indicating copy to clipboard operation
jena copied to clipboard

Improvements to path evaluation

Open afs opened this issue 2 years ago • 2 comments

This issue proposes a roadmap for changes to path evaluation.

4.7.0 is December-ish and is already a significant release because of other, unrelated changes.

There are several strands in progress.

  1. Iterators, not materialized lists, so LIMIT and timeout have an effect.
  2. Better starting points for P_Alt in the case of _ (:p1|:p2) _.
  3. Further P_Alt evaluation - transform or PathLib.
  4. Triple pattern reordering across paths.

Iterators

Item 1 can be done now. We then have a starting point for the rest of the items. It needs to ensure PathLib iterators are closed eventually (caveat: cancellation is not on the iterator execution thread).

The iterators in PathLib don't need to be QueryIterators. They get wrapped in a QueryIterator so cancellation will still happen. Cancellation needs to close (Iter.close) the iterators carefully (caveat: different threads). (functional decomposition) is better is not clear.

Proposal:

  • Do this item now as a separate step

P_Alt starting points

PathLib.determineUngroundedStartingSet could handle the (:p1|:p2) case instead of relying on OpUnion expansion. There are more cases that just (:p1|:p2) that are the same general mechanism of "seeding by property" -- :p+, (:p+|:q) -- to find starting points but cardinality rules start to come into play.

Proposal:

  • do point 2 in PathLib because it applies to other forms.

The test coverage needs to be verified.

Up to here is a target for 4.7.0 as 2 changes.

It gives a common base for further changes. Otherwise we have one big change that all has to be assessed so it will take longer before any change happens.

Further Path and Transform work

The objective is to have one set of active code in the codebase.

This needs an investigation is to run tests to perform both transforms and compare the output to look for their optimization coverage.

The PathTransform (PathVisitor based) one may be a better base because it naturally handles P_Alt, but it may be harder to consider triple pattern reordering across paths and harder for some other cases like ?x (:p1+|...) ?z which involve multiple path operators to be examined. The PathCompiler approach is easier for handling state across different path operators and keeps what is the original path together but isn't yet flexible enough.

Process

Process-wise, discussing architecture on PRs does not work well. The PR UX is focues on keeping code-related comments together but foir longer running PRs when broader points are made, the time-ordering discussion of the overall architecture is lost.

Proposal:

  • Use this github issue for the general and architecture discussions
  • Use the PR for discussion on specific code

See also #1616

afs avatar Nov 19 '22 16:11 afs

Changing the PathEvaluator to Iters would be able to fix queries with one end, like ?class wdt:P279{,13} wd:Q35120 on wikidata (get all classes that have a maximum length path)

SimonBin avatar Nov 21 '22 09:11 SimonBin

Another possible future improvements for TDB 2 wrt Point 2.

Once PR #1655 is merged in (post 4.7.0 of course) the same mechanism we're using on a Graph based index to find unique graph names could also be leveraged on other indices with a longer prefix. Since you already know the graph you care about a distinct prefix search with a 16 byte prefix on the GS?? or GO?? indices (filtering for the desired graph value) might be more efficient than iterating the index and applying distinct afterwards. Would need some experimentation to see how it compares with the current implementation.

Also needs a bit of design thinking about how we signal to PathLib that the Graph implementation being used has a more efficient way of listing the distinct subjects and/or objects:

  • Maybe an extra marker interface?
  • Or just add a new default method(s) to Graph that is called instead of the current helper method(s) so specific implementation can provide a more efficient implementation

rvesse avatar Dec 09 '22 09:12 rvesse