cypher-for-gremlin icon indicating copy to clipboard operation
cypher-for-gremlin copied to clipboard

Unable to filter out paths containing self edges using a cypher Match query.

Open pushkarnagpal opened this issue 5 years ago • 5 comments

Trying the following query to get a path using a match query to filter out paths that contain self edges (node with edges to itself to avoid increased length of path:

match p = (base:party{node_id:'245793137123'})<-[trail:ownership*..]-(leaves:party) where none( rel in relationships(p) where startNode(rel) = endNode(rel) ) return count(p)

But this query is unable to filter out the data and I get the entire set.

pushkarnagpal avatar Aug 26 '19 13:08 pushkarnagpal

Hello @pushkarnagpal,

Are you sure you've provided a valid Cypher query?

It does not compile in Cypher for Gremlin and Neo4j gives error about multiple WHERE cases.

image

dwitry avatar Aug 27 '19 07:08 dwitry

Hello @pushkarnagpal,

Are you sure you've provided a valid Cypher query?

It does not compile in Cypher for Gremlin and Neo4j gives error about multiple WHERE cases.

image

Hi @dwitry I've updated the query now. (had accidentally written not instead of none in the query.

match p = (base:party{node_id:'245793137123'})<-[trail:ownership*..]-(leaves:party) where none( rel in relationships(p) where startNode(rel) = endNode(rel) ) return count(p)

Please look into the issue now. Thanks!

pushkarnagpal avatar Aug 27 '19 08:08 pushkarnagpal

Variable length path in loops is a complex case, which is tricky to implement in Gremlin to fully cover all the edge cases. I've invested a lot of time trying to properly implement it (1, 2) but the solution still not universal. Unfortunately can not provide any estimates when this will be improved.

dwitry avatar Aug 27 '19 09:08 dwitry

Variable length path in loops is a complex case, which is tricky to implement in Gremlin to fully cover all the edge cases. I've invested a lot of time trying to properly implement it (1, 2) but the solution still not universal. Unfortunately can not provide any estimates when this will be improved.

Ahh okay. I was working on a fraud detection case.and got stuck here. I saw this query conversion from the cypher in the gremlin-server.log file, and I see a .neq(' cypher.null') in the query and i assume this step is occuring at the none where clause. I think this is the root cause of the bug.

g.V().as('base')
	.hasLabel('party').has('node_id', eq('245793137123'))
	.repeat(__
				.inE('ownership')
				.as('trail')
				.aggregate('  cypher.path.edge.p')
				.outV()
		)
	.emit()
	.until(__
			.path()
			.from('base')
			.count(local)
			.is(gte(21))
		)
	.as('leaves')
	.hasLabel('party')
	.path()
	.from('base')
	.as('p')
	.where(__
			.not(__.V().hasLabel('party').outE('ownership').inV().as('  GENERATED4').where(__.select('  GENERATED4').where(eq('leaves'))))
		)
	.optional(__
		.select(all, 'trail').as('trail')
		)
	.select('p')
	.is(neq('  cypher.null'))
	.count()
	.project('count(p)')
	.by(__.identity())

Hope I could help somehow with this. Thanks!

pushkarnagpal avatar Aug 27 '19 10:08 pushkarnagpal

.neq(' cypher.null') is guard to handle cypher.null which represents null in Cypher for Gremlin.

We may call "root cause" the following snippet:

.repeat(__
            .inE('ownership')
            .as('trail')
            .aggregate('  cypher.path.edge.p')
            .outV()
    )
.emit()
.until(__
        .path()
        .from('base')
        .count(local)
        .is(gte(21))
    )

This is how variable length path <-[trail:ownership*..]- is translated. Basically, it is an imperative way to traverse relationships (21 is limit to avoid infinite loops). If there is a loop, the path will be traversed repeatedly (what happens in the query you've provided).

So we need to find a better implementation of this in Gremlin, but as I've said in the previous message this is a rather complicated task, as there are other edge cases that need to be considered, as well as performance.

dwitry avatar Aug 27 '19 14:08 dwitry