Improve the changelog and / or create a blog post about recent `repeat` step multi-query improvements
This issue is here to track documentation improvements for the next feature: #3783
As seen from the next discussion the change-log regarding this improvement is hard to understand and people might be confused what really changed. We need to improve wording, add more examples, explain how JanusGraph's batch queries was working before #3783 and how it works now.
Current changelog says:
Batch registration for nested batch compatible steps is changed for repeat step
Previously any batch compatible steps like out, in, values, etc. would receive vertices for batch registration
from all repeat parent steps, but only for their starts in case of multi-nested repeat steps
(skipping their subsequent iterations registration).
With JanusGraph 1.0.0 batches registration for the subsequent iterations of multi-nested repeat steps are used as well.
g.V(startVertexId).emit().
repeat(__.repeat(__.in("connects")).emit()).
until(__.loops().is(P.gt(2)))
In the example above multi-nested repeat case would not register vertices returned from the inner emit() step
for the next outer iteration which would result in sequential calls of in("connects") for next outer iteration. The behaviour is
now changed to register these vertices for the next child repeat step start.
The behaviour can be controlled by query.batch.repeat-step-mode configuration option.
In case the old behaviour is preferable then query.batch.repeat-step-mode should be set to starts_only_of_all_repeat_parents.
However, in cases when transaction cache is small and repeat step traverses more than one level
deep, it could result for some vertices to be re-fetched again which would mean a waste of operation when it isn't necessary.
In such situations closest_repeat_parent mode might be more preferable than all_repeat_parents.
With closest_repeat_parent mode vertices for batch registration will be received from the start of the closest
repeat step as well as the end of the closest repeat step (for the next iteration). Any other parent repeat steps
will be ignored.
The changelog's first sentence is quite confusing and hard to understand. We need to restructure it in something different or simply write a blog post to explain what changed.
In short, the old behavior was a bad version of the new starts_only_of_all_repeat_parents mode.
I say a bad version because:
- Previously not only
MultiQueriablesteps which arestartsteps of theirrepeatparent steps would receive vertices for batch queries, but ALLMultiQueriablechildren steps would receive vertices for batching from all theirrepeatsteps. For example,g.(v1,v2,v3).repeat(out().out().out().out()).emit()- in this example you may think that only the firstoutstep will be registered withJanusGraphMultiQueryStepwhich is placed before.repeat()step. It's true after #3783 , but not the case before. Before #3783 we would register each of those fouroutsteps withJanusGraphMultiQueryStep. So, each of those steps will perform badly for their first batch query most likely because they will have to perform unnecessary operations for unnecessary vertices before they have a chance to perform batch requests for needed vertices. This could be considered as aperformancebug. Nevertheless it's fixed in #3783 . - Next iterations were often not considered. For example,
g.V(v1).emit().repeat(out()).until(loops().is(P.gt(5))). In this situation we don't registerout()result of the previous iteration toout()of the next iterations. Thus, we basically perform single vertex batches which are quite inefficient. In #3783 the behavior is changed and we now register whatever result we have afterrepeatiteration with the next iteration. - multi-nested
repeatsteps are more trickier because we might now want to register next iterations for nestedrepeatsteps. Thus, we have to have some modes which control the behavior. Do we want to register with all parents'repeatstep start? Do we want to register with all parents'repeatstep end (i.e. next iterations)? It all depends in case by case situation. For example,g.V(v1,v2,v3).emit().repeat(__.repeat(__.in("connects")).emit()).until(__.loops().is(P.gt(10)))in this situation we need to understand that eachrepeatiteration performs a singleTraverser. So, do we really want to registerv1,v2, andv3within("connects")as the first batch query request? In case transaction cache is small then it might be that when we traversev2afterv1- the cache is already gone and we will perform multi-query request forv2again, thus making the first multi-query request redundant.v2is not used on the first traverser going intoin("connects")and most likelyv2won't be even second or third traverser going intoin("connects")because there could be several levels going fromv1. Nevertheless, in some situations users want to retrievev2andv3together withv1at the first access because they know that their transaction cache is big enough and they eventually will accessv2andv3during their traversal. The same logic applies to the next iterations as well. I.e. we don't know if the user wants to request vertices in batch for the next iteration or not. We can say it only for the first parentrepeatstep, but not for all parentrepeatsteps. IN SHORT: Previously the logic would not take into account next step iterations, but always would always register vertices from the start of ALL repeat steps which are direct parents to each other. It's not changed and we can now say if we want to use the closestrepeatparent step only, allrepeatsteps for starts or allrepeatsteps both for starts and ends. - Continuing to talk about multi-nested repeat steps, if we check the previous issue, we see that I said ALL repeat steps which are direct parents to each other. This is actually another flow. The problem is that we don't take into account non-repeat steps in this situation. For example,
g.(v1,v2,v3).emit().repeat(union(repeat(out()).emit())).until(loops().is(5))- in this situation, as you might notice, innerrepeatstep is not a direct child of outerrepeatstep. Even sounionis a start step and the innerrepeatstep is also a start step -v1,v2,v3won't be registered for the first batch because thoserepeatsteps are not directly referenced. In #3783 the behavior is improved and we can now detect and skip any multi-query compatible start parent steps. Thusv1,v2, andv3will be registered for the first batch request (ifall_repeat_parentsorstarts_only_of_all_repeat_parentsmodes are used).
We need to explain all of the above information in some other form which is easier to catch up by current users. Also, we need to explain why those repeat modes exist and how they work.
Here is the current modes descriptions from the batch-processing.md:
Multi-nested repeat step modes:
By default, in cases when batch start steps have multiple repeat step parents the batch registration is considering all repeat
parent steps.
However, in cases when transaction cache is small and repeat step traverses more than one level
deep, it could result for some vertices to be re-fetched again or vertices which don't need to be fetched due to early
cycle end could potentially be fetched into the transaction cache. It would mean a waste of operation when it isn't necessary.
Thus, JanusGraph provides a configuration option query.batch.repeat-step-mode to control multi-repeat step behaviour:
-
closest_repeat_parent(default option) - consider the closestrepeatstep only.
In the example above,g.V().repeat(and(repeat(out("knows")).emit())).emit()out("knows")will be receiving vertices for batching from theandstep input for the first iterations as well as theout("knows")step output for the next iterations. -
all_repeat_parents- consider registering vertices from the start and end of eachrepeatstep parent.
In the example above,g.V().repeat(and(repeat(out("knows")).emit())).emit()out("knows")will be receiving vertices for batching from the most outerrepeatstep input (for the first iterations), the most outerrepeatstep output (which isandoutput) (for the first iterations),
theandstep input (for the first iterations), and from theout("knows")output (for the next iterations). -
starts_only_of_all_repeat_parents- consider registering vertices from the start of eachrepeatstep parent.
In the example above,g.V().repeat(and(repeat(out("knows")).emit())).emit()out("knows")will be receiving vertices for batching from the most outerrepeatstep input (for the first iterations), theandstep input (for the first iterations), and from theout("knows")output (for the next iterations).
Just want to say that I also think that a blog post would be great here. In general, I think that it would be really great if we could create some blog posts to accompany the 1.0.0 release as we have quite some interesting features in it that we want to make users aware of. Downside is of course that someone needs to take the time to write those posts. It also looks like we already have a domain in place for such a blog: https://github.com/JanusGraph/blog.janusgraph.org
Just want to say that I also think that a blog post would be great here. In general, I think that it would be really great if we could create some blog posts to accompany the 1.0.0 release as we have quite some interesting features in it that we want to make users aware of. Downside is of course that someone needs to take the time to write those posts. It also looks like we already have a domain in place for such a blog: https://github.com/JanusGraph/blog.janusgraph.org
I have plans to write a blog post related to multi-query improvements in JanusGraph. That said, I’m not sure what platform to use yet. I know only Medium for blog posts. It would be great to have all related blog posts under blog.janusgraph.org, but I didn’t research how to do that, so not sure about it. I guess some blog posts UI integration could exist.
I think that it would be really great if we could create some blog posts to accompany the 1.0.0 release as we have quite some interesting features in it that we want to make users aware of
I was planning to write a blog post to introduce the string vertex id feature. I was planning to publish it on my own medium blog, but if there's an official one set up, I'd like to publish it there too.