neo4j-graph-algorithms icon indicating copy to clipboard operation
neo4j-graph-algorithms copied to clipboard

Consider adding support for Personalized PageRank (aka Topic Sensitive PageRank)

Open ghost opened this issue 6 years ago • 6 comments

An extension of the PageRank (PR) algorithm exists referred to as both Personalized PageRank (PPR) and Topic Sensitive PageRank.

The PR algorithm ranks all vertices of a graph, where more “important” nodes gains high scores. This score is highly dependent on the degree of the node. The PR algorithm and the random surfer view on this algorithm is described in multiple sources ( start here https://en.wikipedia.org/wiki/PageRank) .

In PPR each vertex has a probability/score representing the likelihood that the random surfer visits that node, in the case where he selects to “teleport”. In PPR the probability of teleportation is also controlled by the damping factor.

The parameter introduced in PPR is often referred to as the restart vector, being a vector of probabilities of length N. N being the number of vertices in the graph.

More effort and research should done in finding the parallell implementation of PPR, most suitable to implement in neo4-graph-algorithms.

The PPR is more personalized than PR, in the sense that the graph walk are biased towards the nodes with high scores in the restart vector. This algorithm has several use cases i.e. recommendation systems, personalized searches, who-follows-who, fraud propagation.

My current application of PPR is fraud propagation, and in my opinion it would make sense to support this in neo4j. Both because you now have support for PR. And because neo4j already is powerful when it comes to fraud detection using “pattern matching” with Cypher. PPR support can take this a step further.

Some thoughts on implementation

  • The PPR should take a parameter referring to the node property holding the restart probability
  • In the case where only a few nodes have restart prob. > 0 some optimisation is necessary. In these sparse scenarios it does not make sense to visit all nodes to fetch the restart prob.
  • Consider adding support for weights on edges, this is more relevant for PPR than PR.
  • Consider adding support for directed edges. Only following outgoing edges.

ghost avatar Aug 11 '17 08:08 ghost

I would be interested in this as well!

adstruble avatar Aug 16 '17 16:08 adstruble

We have a version of this algorithm as described here - https://neo4j.com/docs/graph-algorithms/current/algorithms/page-rank/#_personalized_pagerank

Let us know if there are any changes we should make to it so it helps you solve your use cases.

Cheers, Mark

mneedham avatar Sep 04 '18 09:09 mneedham

@mneedham : Good to see the "algo-team" is making progress on PPR features. Reading the documentation, it seems as the new PPR algorithm supports reset vectors of binary type [1,0,0,0,1] etc, as requested in #472 . Not the more general case described here.

Is that correct ? If so, any plans of extending it to the more general case?

Regards Rune

ghost avatar Sep 04 '18 09:09 ghost

Yeh you're right it's the reset vectors. We can look at extending it to do the probability approach as well. Do you have a preferred way how the procedure should look?

e.g. right now it's like this:

MATCH (siteA:Page {name: "Site A"})
CALL algo.pageRank.stream('Page', 'LINKS', {iterations:20, dampingFactor:0.85, sourceNodes: [siteA]})
YIELD nodeId, score
MATCH (node) WHERE id(node) = nodeId
RETURN node.name AS page,score
ORDER BY score DESC

What should the config parameter look like for the probability based version?

mneedham avatar Sep 25 '18 13:09 mneedham

In the general case, supporting reset vectors where individual nodes can have different probabilities. I think it would make sense to pass a named property reference, referring to a node property of type double/float which contains the actual probability of that node being visited.

ghost avatar Sep 26 '18 12:09 ghost

Cool, that sounds like a good way to implement it and works well with the Cypher projection option too. I think it can probably just be any double value and then we'll sum them all together and come up with the probability

mneedham avatar Sep 28 '18 12:09 mneedham