[Q&A] How to capture path length?
How to let pygrahistry to preserve the path length from root to each given node? Suppose I want to search 3 hops starting from Niaj. Using hops=3 will return all the nodes that occur in the result, but it won’t include the path lengths. To include path lengths, I could perform the hops step by step in order to add tag columns(hop1, hop2, etc.). However, if a node is the last one before the final step, it will be dropped after performing the next hop. As a result, only nodes that can actually be reached within the given number of hops are captured.
Hi @morristai thanks for the pictures and example to help discuss
-
I'm surprised the results are different tbh, and suspect a bug in the first case, as Tina should not be selected: there is no 3-hop forwards flow from Niaj that leads to Tina, so should have been dropped as in the second query. Will investigate .
-
Currently we use 'name=' to label, so your thinking is right for on-the-fly labeling. Labeling path step length is interesting too, and we are starting to look at adding path enumeration operators in general vs current graph one. Our priority is getting multihop predicates and cypher syntax, and in the process, think we figured out vectorized paths :)
-
However, are you also asking if there is a way to allow hop ranges, likes hops 1..3, instead of just hops=3 (exact)?
Thanks for the quick response!
- However, are you also asking if there is a way to allow hop ranges, likes hops 1..3, instead of just hops=3 (exact)?
Yes, hop ranges would definitely be helpful. Do you have any suggestions on how to implement this for now? (Is this a limitation of PyGraphistry or cuGraph?)
I confused myself in the reply -- current n-hop is already 1..n: so inclusive, not 'exactly n hops'. That is why tina (n=2) was allowed in case 1, and rejected in 2 . So I believe no bug. Your example 2 shows how to force exactly n right now.
- For exactly n, you can lean on python shorthand for the exactly n hops scenario:
[n(), *[e(name=f'hop_{i}') for range(5), n()]
- Longer-term, adding controllable ranges like
2..5(between 2 and 5) and5..5(exactly 5) makes sense, as does an ability to introspect on the hop #. I'll file a ticket to explore. Some of this is a natural part of the increasing cypher compatibility push so timely.
RE:cuGraph, good question, we've been meaning to write this stuff up. GFQL's declarative multi-hop graph pattern/motif matching is purely pygraphistry side. Our engine is a vectorized solver flow over generic cudf / pandas table operators, with a similar insight as the yannakakis 2-pass semijoin reduction db alg. This lets us do bulk vector operations, vs the traditional path enumeration flows that almost all cypher implementations do, while maintaining the declarative nature of cypher over gremlin. We're looking into adding path enumerations, with the goal being making them a 'pay-as-you-go' engine mode vs how ~all cypher engines pay that asymptotic penalty all the time.
Thanks for the clear explanation — TIL.
However, I still hope that PyGraphistry can retain every edge it encounters in the _edges dataframe during traversal. At the moment, I don’t think there’s a way to preserve path length when performing separated range-hops (as it drops every node that ends before final hop).
I'm not seeing the path to Tina that you'd expect to be in the match set -- I only see a 2 hop and 4+ hop paths, while the query is requiring 3:
- 2:
niaj -> bob -> tina - 4:
niaj -> ivan -> uma -> zack -> tina - ...
What would the 3-hop path to tina be that you'd expect to show up here as a match?
Or the query you want is some sort of 'rejection' query, where you want to see all one/two/three-hop queries rooted at Niaj, and a way to know the partial 1..2 paths that ultimately failed to lead to niaj?
today
For now, you can manually wrangle the graph to find such 'incomplete' nodes:
# nodes & edges 1-2 hops out
g1_2 = g.gfql([n({'entityId': 'Niaj'}), e_forward(hops=2, name='hop1..2')])
# nodes & edges 1-3 hops out
g1_3 = g.gfql([n({'entityId': 'Niaj'}), e_forward(hops=3, name='hop1..3')])
# nodes & edges 3 hops out
g3_3 = g.gfql([n({'entityId': 'Niaj'}), e_forward(), e_forward(), e_forward(), n(name='hop3')])
# Find 1..3-hop nodes and show which are in 1..2 but not 3:
enrich1 = g1_3.nodes(g1_3._nodes.assign(
g1_3._nodes.merge(g1_2._nodes, how='left', on=g1_2._node)
))
enrich1.nodes(enrich1._nodes.assign(incomplete=enrich1._nodes['hop1..2'] && !enrich1._nodes['hop3']))
...
future: mark
We're playing with the idea of a mark mode: Use matches to enrich with a label, vs current filter, and especially pipelined with the new let operator @ https://github.com/graphistry/pygraphistry/issues/755 :
Manuel:
# 1..3-hop graph
g1_3 = g.gfql([n({'entityId': 'Niaj'}), e_forward(hops=3')])
# mark 1..2 hops & 3 hops
g1_3b = g1_3.mark([n({'entityId': 'Niaj'}), e_forward(hops=2, name='hop1..2')], name = 'hops1..2')
g1_3c = g1_3b.mark([n({'entityId': 'Niaj'}), e_forward(), e_forward(), e_forward(), n(name='hop3')])
# mark where 1..2 hops but not 3 hops
g1_3d = g1_3c.nodes(g1_3c._nodes.assign(incomplete=enrich1._nodes['hop1..2'] && !enrich1._nodes['hop3']))
Wrangling these enrichments is still a bit annoying, so we're also looking for let() mark support for clearer enrichments:
Streamlined:
g1_3.let({
'g1_3b': ...,
'g1_3c': ...,
'g1_3d': ...
})
Please forgive any misunderstandings. What I’d like to clarify is that e(hops = n) should not remove any data from previous results. For example, if a user wants to collect the distance to any given node reachable within n hops, they can add a custom name to either n or e to capture that information.
In the first example, Tina appears in the result, and I know it takes 2 steps from Niaj thanks to target2. However, in the second example, Tina is missing because the third hop drops her from both the _node and _edge sets. Understanding all nodes within n hops is important, as there are several practical use cases — for instance, analyzing how a compromised endpoint can spread to nearby neighbors and how far it can propagate.
Perhaps we could preserve all information in _edges during traversal, but not in _nodes. This way, users could still exclude “Tina” from the _edges dataframe by using the unique nodes in the _nodes dataframe, if they prefer.
g2 = g.gfql([
n({'entityId': is_in(["Niaj"])}), name='entry'),
e_forward(hops=1, name="hop1"),
n(name='target1'),
e_forward(hops=1, name="hop2"),
n(name='target2'),
])
g2._nodes
| entityId | entry | target1 | target2 | |
|---|---|---|---|---|
| 0 | Niaj | True | False | False |
| 1 | Bob | False | True | False |
| 2 | Ivan | False | True | False |
| 3 | Uma | False | False | True |
| 4 | Vera | False | False | True |
| 5 | Tina | False | False | True |
g2 = g.gfql([
n({'entityId': is_in(["Niaj"])}), name='entry'),
e_forward(hops=1, name="hop1"),
n(name='target1'),
e_forward(hops=1, name="hop2"),
n(name='target2'),
e_forward(hops=1, name="hop3"),
n(name='target3')
])
g2._nodes
| entityId | entry | target1 | target2 | target3 | |
|---|---|---|---|---|---|
| 0 | Niaj | True | False | False | False |
| 1 | Bob | False | True | False | False |
| 2 | Ivan | False | True | False | True |
| 3 | Uma | False | False | True | False |
| 4 | Vera | False | False | True | False |
| 5 | Sam | False | False | False | True |
| 6 | Zack | False | False | False | True |
Ah thank you for the blast radius example. So I think the scenario is two things:
-
want "1..n" hops out, so any of 1, 2, 3, ... n, hops out vs exactly n: <--
e(hops=n)already does this -
want to know the earliest this happens, so later you can do something like color on radius <-- non-obvious right now
For implementing, I see a few ways:
-
long-term path length: when we add path predicate support, we can probably do something like
match p=a-[1..4]->b return p, len(p)and fancier forms that put the label on the edges -
today: label via calling bfs/sssp
from graphistry import let, ref, n, call, between
root_id = "abc"
result = g.gfql(let({
# 1) 1..3 hop neighborhood subgraph rooted at `abc`
'subg': [
n({'id': root_id}, name='root'),
e_forward(hops=3)
],
# 2) On that subgraph, run cuGraph or igraph BFS to label min hops
'labeled': ref('subg', [
call('compute_cugraph', {
'alg': 'bfs',
'out_col': 'min_hops',
'params': {
'start': root_id,
'depth_limit': 3,
'return_predecessors': False,
},
}),
]),
}))
# access labeled result
nodes = result._nodes[result._nodes['labeled']]
nodes[['id', 'min_hops']]
- medium term: we add some sort of field
e_forward(hops=3, label_hops=True)to just automate the flow , giving each node/edge something closer to what you're initially suggesting
Hi @morristai fyi working to add as a native capability in https://github.com/graphistry/pygraphistry/pull/851
In parallel adding a new implementation verifier (alloy model checker) for these kind of additions, so will likely land together
Hi @morristai fyi working to add as a native capability in #851
In parallel adding a new implementation verifier (alloy model checker) for these kind of additions, so will likely land together
I appreciate your response and the effort you put into this—amazing!