database icon indicating copy to clipboard operation
database copied to clipboard

PipelinedHashIndexAndSolutionSetJoinOp messes up SERVICE + MINUS

Open smalyshev opened this issue 5 years ago • 24 comments

When running this query with LIMIT, MINUS does not seem to apply properly:

SELECT DISTINCT ?item WHERE {
  hint:Query hint:optimizer "None".
  SERVICE wikibase:mwapi {
    bd:serviceParam wikibase:api "Search";
                    wikibase:endpoint "www.wikidata.org";
                    mwapi:srsearch "zika haswbstatement:P31=Q13442814".
    ?title wikibase:apiOutput mwapi:title.
  }
  BIND(IRI(CONCAT(STR(wd:), ?title)) AS ?item)
  MINUS { ?item wdt:P921 wd:Q202864. }
} 
# will work fine without LIMIT
LIMIT 10000

The difference is that without LIMIT, the query uses JVMSolutionSetHashJoinOp to perform the join, while with LIMIT it uses PipelinedHashIndexAndSolutionSetJoinOp (per usePipelinedHashJoin in AST2Bop.addSubgroup). That last one seems to not do the MINUS correctly.

smalyshev avatar Nov 16 '18 18:11 smalyshev

Digging a bit more into it, in JVMPipelinedHashJoinUtility/acceptAndOutputSolutions this loop:

for (IBindingSet solution : solutions) {

                        // add solutions to the subquery into the hash index.
                        rightSolutions.add(solution);

                        /*
                         * we remove all mappings that generated at least one
                         * result from distinct set (which will be further
                         * processed later on); This is how we discover the set
                         * of distinct projections that did not join.
                         */
                        distinctProjectionBuffer.remove(solution.copy(getJoinVars()));

                        nResultsFromSubqueries.increment();
                    }
```					
Is supposed to remove mappings from `distinctProjectionBuffer` that have matching solutions. In fact, it does not remove anything (`distinctProjectionBuffer` remains the same after the loop is done). This seems to happen because `getJoinVars()` is empty here (maybe this is a bug, because there's obviously a join over "item" is happening) and thus it tries to remove an empty binding, not binding for "item" as it was supposed to. This in turn leads to all keys be registered as "not matched" with `distinctProjectionsWithoutSubqueryResult`, which means they make out on MINUS join unaffected, thus essentially ignoring the MINUS. 

It may be that the real bug is whatever is causing joinVars to be empty in this case, but I am not sure how to find what causes it yet, since I am not sure what is the meaning of joinVars. The other (correct) query also have joinVars empty in JVMSolutionSetHashJoinOp, but still manages to work fine and do the correct join. 

smalyshev avatar Nov 16 '18 19:11 smalyshev

Adding Michael. Bryan

On Fri, Nov 16, 2018 at 11:02 AM Stanislav Malyshev < [email protected]> wrote:

Digging a bit more into it, in JVMPipelinedHashJoinUtility/acceptAndOutputSolutions this loop:

for (IBindingSet solution : solutions) {

                    // add solutions to the subquery into the hash index.
                    rightSolutions.add(solution);

                    /*                         * we remove all mappings that generated at least one                         * result from distinct set (which will be further                         * processed later on); This is how we discover the set                         * of distinct projections that did not join.                         */
                    distinctProjectionBuffer.remove(solution.copy(getJoinVars()));

                    nResultsFromSubqueries.increment();
                }

Is supposed to remove mappings from distinctProjectionBuffer that have matching solutions. In fact, it does not remove anything ( distinctProjectionBuffer remains the same after the loop is done). This seems to happen because getJoinVars() is empty here (maybe this is a bug, because there's obviously a join over "item" is happening) and thus it tries to remove an empty binding, not binding for "item" as it was supposed to. This in turn leads to all keys be registered as "not matched" with distinctProjectionsWithoutSubqueryResult, which means they make out on MINUS join unaffected, thus essentially ignoring the MINUS.

It may be that the real bug is whatever is causing joinVars to be empty in this case, but I am not sure how to find what causes it yet, since I am not sure what is the meaning of joinVars. The other (correct) query also have joinVars empty in JVMSolutionSetHashJoinOp, but still manages to work fine and do the correct join.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/blazegraph/database/issues/107#issuecomment-439493654, or mute the thread https://github.com/notifications/unsubscribe-auth/ACdv4PjTd6pF0Kn2reWWqhvsvaXDdgP6ks5uvwvIgaJpZM4Ymy6I .

thompsonbry avatar Nov 16 '18 19:11 thompsonbry

@mschmidt00 ?

smalyshev avatar Nov 16 '18 19:11 smalyshev

Michael Schmidt. He wrote the pipelined hash join code. Bryan On Fri, Nov 16, 2018 at 11:13 AM Stanislav Malyshev [email protected] wrote:

@mschmidt00 ?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

thompsonbry avatar Nov 16 '18 19:11 thompsonbry

Yes, @mschmidt00.

beebs-systap avatar Nov 16 '18 20:11 beebs-systap

Any thoughts on this one?

smalyshev avatar Nov 26 '18 20:11 smalyshev

ping?

smalyshev avatar Dec 06 '18 00:12 smalyshev

I asked Michael to take a look. He did take a brief look. Suggest you and he follow up.

It would definitely help to provide plain blazegraph test cases for bugs whenever possible. Make it much easier to reproduce and analyze problems. The time spent developing a test case reproducer can be very significant for us otherwise.

Bryan

On Wed, Dec 5, 2018 at 4:35 PM Stanislav Malyshev [email protected] wrote:

ping?

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/blazegraph/database/issues/107#issuecomment-444704517, or mute the thread https://github.com/notifications/unsubscribe-auth/ACdv4OqF_0xraJrBUIBWvKzJHdIceG53ks5u2GZpgaJpZM4Ymy6I .

thompsonbry avatar Dec 06 '18 02:12 thompsonbry

I am sorry, I understand it's hard to debug it this way. I can reproduce it with clean install of WDQS and some small data set - so if you'd like, I can set up reproduction on Cloud VPS and you and Michael should have a login there. I haven't found a way to reproduce it without service since it produces a different query plan. I'll try to see maybe I can do the same with federation SERVICE call.

smalyshev avatar Dec 06 '18 07:12 smalyshev

Another technique to capture these problems is to enable the solutions logger in log4j.properties. This shows everything flowing into/out of each operator. That can give you the data you need to write a unit test of a specific operator or a subplan.

Bryan

On Wed, Dec 5, 2018 at 11:22 PM Stanislav Malyshev [email protected] wrote:

I am sorry, I understand it's hard to debug it this way. I can reproduce it with clean install of WDQS and some small data set - so if you'd like, I can set up reproduction on Cloud VPS and you and Michael should have a login there. I haven't found a way to reproduce it without service since it produces a different query plan. I'll try to see maybe I can do the same with federation SERVICE call.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/blazegraph/database/issues/107#issuecomment-444773387, or mute the thread https://github.com/notifications/unsubscribe-auth/ACdv4LPdHARAgxgr4-C_trFc_pxyZK9Uks5u2MWbgaJpZM4Ymy6I .

thompsonbry avatar Dec 06 '18 14:12 thompsonbry

This section should be in the default log4j.properties file. You just uncomment one line to enable this logger.

Solutions trace (tab delimited file). Uncomment the next line to enable.

#log4j.logger.com.bigdata.bop.engine.SolutionsLog=INFO,solutionsLog log4j.additivity.com.bigdata.bop.engine.SolutionsLog=false log4j.appender.solutionsLog=org.apache.log4j.ConsoleAppender #log4j.appender.solutionsLog=org.apache.log4j.FileAppender log4j.appender.solutionsLog.Threshold=ALL #log4j.appender.solutionsLog.File=solutions.csv #log4j.appender.solutionsLog.Append=true

I find that it is nicer to have this unbuffered since you can see what

is going on and to make sure that I have complete rule evaluation logs

on shutdown.

#log4j.appender.solutionsLog.BufferedIO=false log4j.appender.solutionsLog.layout=org.apache.log4j.PatternLayout log4j.appender.solutionsLog.layout.ConversionPattern=SOLUTION:\t%m

On Thu, Dec 6, 2018 at 6:57 AM Bryan Thompson [email protected] wrote:

Another technique to capture these problems is to enable the solutions logger in log4j.properties. This shows everything flowing into/out of each operator. That can give you the data you need to write a unit test of a specific operator or a subplan.

Bryan

On Wed, Dec 5, 2018 at 11:22 PM Stanislav Malyshev < [email protected]> wrote:

I am sorry, I understand it's hard to debug it this way. I can reproduce it with clean install of WDQS and some small data set - so if you'd like, I can set up reproduction on Cloud VPS and you and Michael should have a login there. I haven't found a way to reproduce it without service since it produces a different query plan. I'll try to see maybe I can do the same with federation SERVICE call.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/blazegraph/database/issues/107#issuecomment-444773387, or mute the thread https://github.com/notifications/unsubscribe-auth/ACdv4LPdHARAgxgr4-C_trFc_pxyZK9Uks5u2MWbgaJpZM4Ymy6I .

thompsonbry avatar Dec 06 '18 14:12 thompsonbry

Hey Stas,

it strikes me there's something wrong in the way we're using projectInVars vs. joinVars. When filling the distinctProjectionBuffer, we're using projectInVars to project on the incoming solutions (see some lines above the code you've been citing):

            // Take a distinct projection of the join variables.
            final IBindingSet bsetDistinct = chunk[i].copy(projectInVars);

Later, in the loop you identified, we're doing a projection on the joinVars:

                    distinctProjectionBuffer.remove(solution.copy(getJoinVars()));

While there is a (typically strong) correlation between joinVars and projectInVars they are not necessarily the same (joinVars is a subset of projectInVars). I didn't validate it, but my guess would be that the latter should be changed to:

                    distinctProjectionBuffer.remove(solution.copy(projectInVars);

Could be worth trying that as a fix, but I'd definitely want to carefully validate this, as there might be some unexpected implications/complications/consequences. In any case, it is important that the bindings that produce results via the subquery are removed from the distinctProjectionBuffer, and the fact that this is not happening seems to be the root cause for the MINUS semantics being broken.

Could you please share the value of the following variables at the time when entering the loop you identified (the values I give below would reflect my guesses/expectations) -- if there are too many solutions, few sample values would help as well:

  • projectInVars => [?item]
  • joinVars = []
  • solutions => i.e., a couple of solutions that we iterate over
  • distinctProjectionBuffer => i.e., the status of the buffer at the time when entering the loop

With this data at hand, I may also be able to come up with a SPARQL standard only reproducer for the problem.

Some background re. your comment on joinVars: for hash joins, the join variables are defined as those variables from the LHS and RHS of the join that overlap and are guaranteed to be bound in both parts. The situation for the query above is that "BIND(IRI(CONCAT(STR(wd:), ?title)) AS ?item)" does not give you a strong guarantee that the ?item variable will be bound (in the general case, BIND can fail and produce an unbound variable). I'm pretty sure this is the reason why joinVars are actually empty, and it is expected to be that way.

mschmidt00 avatar Dec 08 '18 06:12 mschmidt00

Tracing into acceptAndOutputSolutions, I see this:

  • joinVars = []
  • projectInVars = [item]
  • solutions = a bunch of items (99 at all for first iteration), e.g. Q22241243, Q22252392, Q22330669. Checking against Wikidata, those seem to be answers to "MINUS" clause
  • distinctProjectionBuffer - holds 100 items, probably ones received from SERVICE.

Changing getJoinVars() to projectInVars seems to change the solutions to be removed, but the end result seems to be still the same, as if MINUS did not work. I'll dig into it with logs and see what's going on further.

smalyshev avatar Dec 10 '18 22:12 smalyshev

Tried to debug further and I see something strange - when I watch how the execution process proceeds, the first chunk is processed OK. It forms one bucket, since the key formed by JVMHashIndex.makeKey() has no variables (keyVars is empty). However, when the second chunk comes in, it goes straight to dontRequireSubqueryEvaluation branch, because it reuses the same bucket. However, this bucket is not good for this chunk - it seems to contain the solutions relative to the previous chunk, and trying to match it against this chunk results in "no match" for every binding set, which of course means they all get into the result and MINUS does not work. This is after applying the projectInVars fix (looks like it also needs to be applied to HTree class for analytic queries? Because it has similar code) - this fix makes the first chunk work OK, but the subsequent chunk still is not right. keyVars seem to be derived from joinVars in JVMHashJoinUtility, and as we know, joinVars are empty here. So I wonder if there should be some other code that does not reuse the same bucket?

For reference, this is the code that looks up buckets in acceptAndOutputSolutions():

               /**
                 *  Find bucket in hash index for that distinct projection
                 * (bucket of solutions with the same join vars from the
                 *  subquery - basically a JOIN).
                 */
                final Bucket b = rightSolutions.getBucket(bsetDistinct);

Since rightSolutions has empty keyVars, getBucket always returns the same bucket. On first chunk, it's empty, so the code proceeds to regular evaulation and everything is fine. On the second chunk, however, there is the bucket made by the first chunk, which looks completely wrong - since in this bucket there would be no solutions matching the first bucket. Which also implies having just one bucket after first run is wrong (there should be 100 distinct buckets) but this does not hurt anything since these buckets are not used yet. I think the bug is where keyVars is empty. Would be glad to hear your opinion on this.

smalyshev avatar Dec 11 '18 00:12 smalyshev

Also interesting - there's this code in JVMHashIndex:

        if (keyVars == null) {
       
            /*
             * A ZERO LENGTH joinVars[] means that all solutions will be in the
             * same hash bucket. This can arise due to poor assignment of join
             * variables or simply because there are no available join variables
             * (full cross product join). Such joins are very expensive.
             */
            
            throw new IllegalArgumentException();

        }

which implies keyVars should never be empty? But in fact it is empty, just [] not null.

smalyshev avatar Dec 11 '18 00:12 smalyshev

I did a small patch that uses Annotations.PROJECT_IN_VARS instead of HashJoinAnnotations.JOIN_VARS for PipelinedHashIndexAndSolutionSetJoinOp, and it seems to fix the problem (together with the projectInVars one) for non-analytic queries, but it looks like analytic class has the same logic, but different code, which requires different handling. What I did is:

--- a/bigdata-core/bigdata/src/java/com/bigdata/bop/join/JVMHashJoinUtility.java
+++ b/bigdata-core/bigdata/src/java/com/bigdata/bop/join/JVMHashJoinUtility.java
@@ -329,7 +329,7 @@ public class JVMHashJoinUtility implements IHashJoinUtility {
          * Otherwise use the [joinVars].
          */
         final IVariable<?>[] keyVars = filter ? (IVariable<?>[]) op
-                .getProperty(JoinAnnotations.SELECT) : joinVars;
+                .getProperty(JoinAnnotations.SELECT) : op.getKeyVars();^M
--- a/bigdata-core/bigdata/src/java/com/bigdata/bop/join/PipelinedHashIndexAndSolutionSetJoinOp.java
+++ b/bigdata-core/bigdata/src/java/com/bigdata/bop/join/PipelinedHashIndexAndSolutionSetJoinOp.java
@@ -479,4 +479,7 @@ public class PipelinedHashIndexAndSolutionSetJoinOp extends HashIndexOp {
 
     } // ControllerTask
 
+    public IVariable<?>[] getKeyVars() {^M
+        return (IVariable<?>[]) getProperty(Annotations.PROJECT_IN_VARS);^M
+    }^M
 }
--- a/bigdata-core/bigdata/src/java/com/bigdata/bop/PipelineOp.java
+++ b/bigdata-core/bigdata/src/java/com/bigdata/bop/PipelineOp.java
@@ -34,6 +34,7 @@ import java.util.concurrent.FutureTask;
 import com.bigdata.bop.engine.BOpStats;
 import com.bigdata.bop.engine.IChunkMessage;
 import com.bigdata.bop.engine.QueryEngine;
+import com.bigdata.bop.join.HashJoinAnnotations;^M
 
 /**
  * Abstract base class for pipeline operators where the data moving along the
@@ -614,4 +615,7 @@ abstract public class PipelineOp extends BOpBase {
      */
     abstract public FutureTask<Void> eval(BOpContext<IBindingSet> context);
 
+    public IVariable<?>[] getKeyVars() {^M
+        return (IVariable<?>[])getRequiredProperty(HashJoinAnnotations.JOIN_VARS);^M
+    }^M
 }

no idea whether it's correct...

smalyshev avatar Dec 11 '18 01:12 smalyshev

I suspect that the correct behavior when there are no join variables is to map all solutions into the same bucket - same key.

On Mon, Dec 10, 2018 at 17:13 Stanislav Malyshev [email protected] wrote:

I did a small patch that uses Annotations.PROJECT_IN_VARS instead of HashJoinAnnotations.JOIN_VARS for PipelinedHashIndexAndSolutionSetJoinOp, and it seems to fix the problem (together with the projectInVars one) for non-analytic queries, but it looks like analytic class has the same logic, but different code, which requires different handling. What I did is:

--- a/bigdata-core/bigdata/src/java/com/bigdata/bop/join/JVMHashJoinUtility.java +++ b/bigdata-core/bigdata/src/java/com/bigdata/bop/join/JVMHashJoinUtility.java @@ -329,7 +329,7 @@ public class JVMHashJoinUtility implements IHashJoinUtility { * Otherwise use the [joinVars]. */ final IVariable>[] keyVars = filter ? (IVariable>[]) op

  •            .getProperty(JoinAnnotations.SELECT) : joinVars;
    
  •            .getProperty(JoinAnnotations.SELECT) : op.getKeyVars();^M
    

--- a/bigdata-core/bigdata/src/java/com/bigdata/bop/join/PipelinedHashIndexAndSolutionSetJoinOp.java +++ b/bigdata-core/bigdata/src/java/com/bigdata/bop/join/PipelinedHashIndexAndSolutionSetJoinOp.java @@ -479,4 +479,7 @@ public class PipelinedHashIndexAndSolutionSetJoinOp extends HashIndexOp {

 } // ControllerTask
  • public IVariable<?>[] getKeyVars() {^M
  •    return (IVariable<?>[]) getProperty(Annotations.PROJECT_IN_VARS);^M
    
  • }^M } --- a/bigdata-core/bigdata/src/java/com/bigdata/bop/PipelineOp.java +++ b/bigdata-core/bigdata/src/java/com/bigdata/bop/PipelineOp.java @@ -34,6 +34,7 @@ import java.util.concurrent.FutureTask; import com.bigdata.bop.engine.BOpStats; import com.bigdata.bop.engine.IChunkMessage; import com.bigdata.bop.engine.QueryEngine; +import com.bigdata.bop.join.HashJoinAnnotations;^M

/**

  • Abstract base class for pipeline operators where the data moving along the @@ -614,4 +615,7 @@ abstract public class PipelineOp extends BOpBase { */ abstract public FutureTask<Void> eval(BOpContext<IBindingSet> context);
  • public IVariable<?>[] getKeyVars() {^M
  •    return (IVariable<?>[])getRequiredProperty(HashJoinAnnotations.JOIN_VARS);^M
    
  • }^M }

no idea whether it's correct...

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/blazegraph/database/issues/107#issuecomment-446035494, or mute the thread https://github.com/notifications/unsubscribe-auth/ACdv4HbNtNOyyOAW6OvbBBX4TmcgNb6Uks5u3wadgaJpZM4Ymy6I .

thompsonbry avatar Dec 11 '18 02:12 thompsonbry

But then the bucket is re-used here:

                /**
                 *  Find bucket in hash index for that distinct projection
                 * (bucket of solutions with the same join vars from the
                 *  subquery - basically a JOIN).
                 */
                final Bucket b = rightSolutions.getBucket(bsetDistinct);

                if (b != null ||
                    distinctProjectionsWithoutSubqueryResult.contains(bsetDistinct)) {
                    /*
                     * Either a match in the bucket or subquery was already
                     * computed for this distinct projection but did not produce
                     * any results. Either way, it will take a fast path that
                     * avoids the subquery.
                     */
                    dontRequireSubqueryEvaluation.add(chunk[i]);

                } else {

and these solutions are emitted:

       if (!dontRequireSubqueryEvaluation.isEmpty()) {
            // compute join for the fast path.
            hashJoinAndEmit(dontRequireSubqueryEvaluation.toArray(
               new IBindingSet[0]), stats, out, joinConstraints, askVar);
        }
       if (distinctProjectionBuffer.isEmpty()) {
            // Nothing to do on the slow code path.
            return naccepted;
        }

without running the subquery, but the results in the bucket are from previous subquery run with first chunk's data. And since distinctProjectionBuffer would be empty (previous chunk cleaned it up and new chunk hasn't added anything) the subquery won't be evaluated for this chunk. Is that right? Am I missing something here?

smalyshev avatar Dec 11 '18 02:12 smalyshev

I think I found simple reproduction for it. Load the data from here: https://gist.github.com/smalyshev/b45a4dd739592361b413f5234de47dd7

Then run this:

SELECT DISTINCT ?item WHERE {
  hint:Query hint:optimizer "None".
  ?x <label> ?title .
  BIND(IRI(CONCAT("http://www.wikidata.org/entity/", ?title)) AS ?item)
  MINUS { 
    ?item <is> <bad> 
  }
} LIMIT 1000

It returns 199 results. Without limit, it returns (correctly) only one result.

smalyshev avatar Dec 11 '18 05:12 smalyshev

Adding Michael. Bryan On Mon, Dec 10, 2018 at 9:57 PM Stanislav Malyshev [email protected] wrote:

I think I found simple reproduction for it. Load the data from here: https://gist.github.com/smalyshev/b45a4dd739592361b413f5234de47dd7

Then run this:

SELECT DISTINCT ?item WHERE { hint:Query hint:optimizer "None". ?x

It returns 199 results. Without limit, it returns (correctly) only one result.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

thompsonbry avatar Dec 11 '18 16:12 thompsonbry

On it -- confirm the reproducer works for me, thanks Stas!

mschmidt00 avatar Dec 13 '18 03:12 mschmidt00

Hi Stas, as I understand using the key vars for join variables would not work -- the constraint for join variables is that they must be guaranteed to be bound.

I dived a little deeper and identified the existing problems, which are:

1.) JVMPipelinedHashJoinUtility.java 1a.) distinctProjectionBuffer.remove(solution.copy(getJoinVars())); => getJoinVars() needs to be replaced by projectInVars, as dicussed previously 1b.) In the check

if (b != null || 
                distinctProjectionsWithoutSubqueryResult.contains(bsetDistinct))

the b != null condition is not sufficient, we must also check whether distinctProjectionsWithoutSubqueryResult.contains(bsetDistinct) -- i.e., consider the non-join variables (this is because the join variables is empty and the join is performed over the non-join variables).

2.) HTreePipelinedHashJoinUtility.java 2a.) distinctProjectionBuffer.remove(solution.copy(getJoinVars())); => getJoinVars() needs to be replaced by projectInVars, as dicussed previously 2b.) The code

leftSolutionsWithoutMatch.add(leftSolution); // unless proven otherwise

is in the wrong place. The idea of the code was to initialize the array once, in front of the double-nested look (but by mistake the code ended up in an inner loop, where the solution is added to leftSolutionsWithoutMatch over and over again -- one witness is actually sufficient for it to be removed it entirely).

mschmidt00 avatar Dec 18 '18 07:12 mschmidt00

the b != null condition is not sufficient, we must also check whether distinctProjectionsWithoutSubqueryResult.contains(bsetDistinct) -- i.e., consider the non-join variables (this is because the join variables is empty and the join is performed over the non-join variables).

So, should the condition be converted to && or I misunderstood this part?

The code leftSolutionsWithoutMatch.add(leftSolution); // unless proven otherwise is in the wrong place. The idea of the code was to initialize the array once, in front of the double-nested look

But it uses variable leftSolution which is initialized inside the loop. So what should be used instead there?

smalyshev avatar Dec 19 '18 23:12 smalyshev

@mschmidt00 I've made https://github.com/blazegraph/database/pull/113 which seems to fix the queries - could you please check I understood you correctly?

smalyshev avatar Dec 20 '18 00:12 smalyshev