janusgraph icon indicating copy to clipboard operation
janusgraph copied to clipboard

Improve the changelog and / or create a blog post about recent `repeat` step multi-query improvements

Open porunov opened this issue 2 years ago • 3 comments

This issue is here to track documentation improvements for the next feature: #3783 As seen from the next discussion the change-log regarding this improvement is hard to understand and people might be confused what really changed. We need to improve wording, add more examples, explain how JanusGraph's batch queries was working before #3783 and how it works now.
Current changelog says:


Batch registration for nested batch compatible steps is changed for repeat step

Previously any batch compatible steps like out, in, values, etc. would receive vertices for batch registration from all repeat parent steps, but only for their starts in case of multi-nested repeat steps (skipping their subsequent iterations registration). With JanusGraph 1.0.0 batches registration for the subsequent iterations of multi-nested repeat steps are used as well.

g.V(startVertexId).emit().
    repeat(__.repeat(__.in("connects")).emit()).
    until(__.loops().is(P.gt(2)))

In the example above multi-nested repeat case would not register vertices returned from the inner emit() step for the next outer iteration which would result in sequential calls of in("connects") for next outer iteration. The behaviour is now changed to register these vertices for the next child repeat step start.

The behaviour can be controlled by query.batch.repeat-step-mode configuration option.
In case the old behaviour is preferable then query.batch.repeat-step-mode should be set to starts_only_of_all_repeat_parents.

However, in cases when transaction cache is small and repeat step traverses more than one level deep, it could result for some vertices to be re-fetched again which would mean a waste of operation when it isn't necessary. In such situations closest_repeat_parent mode might be more preferable than all_repeat_parents.
With closest_repeat_parent mode vertices for batch registration will be received from the start of the closest repeat step as well as the end of the closest repeat step (for the next iteration). Any other parent repeat steps will be ignored.


The changelog's first sentence is quite confusing and hard to understand. We need to restructure it in something different or simply write a blog post to explain what changed.

In short, the old behavior was a bad version of the new starts_only_of_all_repeat_parents mode. I say a bad version because:

  • Previously not only MultiQueriable steps which are start steps of their repeat parent steps would receive vertices for batch queries, but ALL MultiQueriable children steps would receive vertices for batching from all their repeat steps. For example, g.(v1,v2,v3).repeat(out().out().out().out()).emit() - in this example you may think that only the first out step will be registered with JanusGraphMultiQueryStep which is placed before .repeat() step. It's true after #3783 , but not the case before. Before #3783 we would register each of those four out steps with JanusGraphMultiQueryStep. So, each of those steps will perform badly for their first batch query most likely because they will have to perform unnecessary operations for unnecessary vertices before they have a chance to perform batch requests for needed vertices. This could be considered as a performance bug. Nevertheless it's fixed in #3783 .
  • Next iterations were often not considered. For example, g.V(v1).emit().repeat(out()).until(loops().is(P.gt(5))). In this situation we don't register out() result of the previous iteration to out() of the next iterations. Thus, we basically perform single vertex batches which are quite inefficient. In #3783 the behavior is changed and we now register whatever result we have after repeat iteration with the next iteration.
  • multi-nested repeat steps are more trickier because we might now want to register next iterations for nested repeat steps. Thus, we have to have some modes which control the behavior. Do we want to register with all parents' repeat step start? Do we want to register with all parents' repeat step end (i.e. next iterations)? It all depends in case by case situation. For example, g.V(v1,v2,v3).emit().repeat(__.repeat(__.in("connects")).emit()).until(__.loops().is(P.gt(10))) in this situation we need to understand that each repeat iteration performs a single Traverser. So, do we really want to register v1, v2, and v3 with in("connects") as the first batch query request? In case transaction cache is small then it might be that when we traverse v2 after v1 - the cache is already gone and we will perform multi-query request for v2 again, thus making the first multi-query request redundant. v2 is not used on the first traverser going into in("connects") and most likely v2 won't be even second or third traverser going into in("connects") because there could be several levels going from v1. Nevertheless, in some situations users want to retrieve v2 and v3 together with v1 at the first access because they know that their transaction cache is big enough and they eventually will access v2 and v3 during their traversal. The same logic applies to the next iterations as well. I.e. we don't know if the user wants to request vertices in batch for the next iteration or not. We can say it only for the first parent repeat step, but not for all parent repeat steps. IN SHORT: Previously the logic would not take into account next step iterations, but always would always register vertices from the start of ALL repeat steps which are direct parents to each other. It's not changed and we can now say if we want to use the closest repeat parent step only, all repeat steps for starts or all repeat steps both for starts and ends.
  • Continuing to talk about multi-nested repeat steps, if we check the previous issue, we see that I said ALL repeat steps which are direct parents to each other. This is actually another flow. The problem is that we don't take into account non-repeat steps in this situation. For example, g.(v1,v2,v3).emit().repeat(union(repeat(out()).emit())).until(loops().is(5)) - in this situation, as you might notice, inner repeat step is not a direct child of outer repeat step. Even so union is a start step and the inner repeat step is also a start step - v1, v2, v3 won't be registered for the first batch because those repeat steps are not directly referenced. In #3783 the behavior is improved and we can now detect and skip any multi-query compatible start parent steps. Thus v1, v2, and v3 will be registered for the first batch request (if all_repeat_parents or starts_only_of_all_repeat_parents modes are used).

We need to explain all of the above information in some other form which is easier to catch up by current users. Also, we need to explain why those repeat modes exist and how they work.

Here is the current modes descriptions from the batch-processing.md:


Multi-nested repeat step modes:

By default, in cases when batch start steps have multiple repeat step parents the batch registration is considering all repeat parent steps.
However, in cases when transaction cache is small and repeat step traverses more than one level deep, it could result for some vertices to be re-fetched again or vertices which don't need to be fetched due to early cycle end could potentially be fetched into the transaction cache. It would mean a waste of operation when it isn't necessary.

Thus, JanusGraph provides a configuration option query.batch.repeat-step-mode to control multi-repeat step behaviour:

  • closest_repeat_parent (default option) - consider the closest repeat step only.
    g.V().repeat(and(repeat(out("knows")).emit())).emit()
    
    In the example above, out("knows") will be receiving vertices for batching from the and step input for the first iterations as well as the out("knows") step output for the next iterations.
  • all_repeat_parents - consider registering vertices from the start and end of each repeat step parent.
    g.V().repeat(and(repeat(out("knows")).emit())).emit()
    
    In the example above, out("knows") will be receiving vertices for batching from the most outer repeat step input (for the first iterations), the most outer repeat step output (which is and output) (for the first iterations),
    the and step input (for the first iterations), and from the out("knows") output (for the next iterations).
  • starts_only_of_all_repeat_parents - consider registering vertices from the start of each repeat step parent.
    g.V().repeat(and(repeat(out("knows")).emit())).emit()
    
    In the example above, out("knows") will be receiving vertices for batching from the most outer repeat step input (for the first iterations), the and step input (for the first iterations), and from the out("knows") output (for the next iterations).

porunov avatar Jun 04 '23 15:06 porunov

Just want to say that I also think that a blog post would be great here. In general, I think that it would be really great if we could create some blog posts to accompany the 1.0.0 release as we have quite some interesting features in it that we want to make users aware of. Downside is of course that someone needs to take the time to write those posts. It also looks like we already have a domain in place for such a blog: https://github.com/JanusGraph/blog.janusgraph.org

FlorianHockmann avatar Jun 05 '23 10:06 FlorianHockmann

Just want to say that I also think that a blog post would be great here. In general, I think that it would be really great if we could create some blog posts to accompany the 1.0.0 release as we have quite some interesting features in it that we want to make users aware of. Downside is of course that someone needs to take the time to write those posts. It also looks like we already have a domain in place for such a blog: https://github.com/JanusGraph/blog.janusgraph.org

I have plans to write a blog post related to multi-query improvements in JanusGraph. That said, I’m not sure what platform to use yet. I know only Medium for blog posts. It would be great to have all related blog posts under blog.janusgraph.org, but I didn’t research how to do that, so not sure about it. I guess some blog posts UI integration could exist.

porunov avatar Jun 05 '23 10:06 porunov

I think that it would be really great if we could create some blog posts to accompany the 1.0.0 release as we have quite some interesting features in it that we want to make users aware of

I was planning to write a blog post to introduce the string vertex id feature. I was planning to publish it on my own medium blog, but if there's an official one set up, I'd like to publish it there too.

li-boxuan avatar Jun 05 '23 16:06 li-boxuan