neo4j-graph-algorithms icon indicating copy to clipboard operation
neo4j-graph-algorithms copied to clipboard

Jaccard Similarity doesn't work with concurrency

Open JorenVdV opened this issue 6 years ago • 3 comments

Problem When running the Jaccard similarity algorithm over a list of node and categories entries all the similarities are 0 when run without concurrency limit set to 1.

Environment Docker image running Neo4j 3.5.3 and graph algorithms 3.5.3.3, memory is limited to 16G, cpu's are unbound (192 cpu's in the machine, shared with other processes)

Setup

MERGE (french:Cuisine {name:'French'})
MERGE (italian:Cuisine {name:'Italian'})
MERGE (indian:Cuisine {name:'Indian'})
MERGE (lebanese:Cuisine {name:'Lebanese'})
MERGE (portuguese:Cuisine {name:'Portuguese'})

MERGE (zhen:Person {name: "Zhen"})
MERGE (praveena:Person {name: "Praveena"})
MERGE (michael:Person {name: "Michael"})
MERGE (arya:Person {name: "Arya"})
MERGE (karin:Person {name: "Karin"})

MERGE (praveena)-[:LIKES]->(indian)
MERGE (praveena)-[:LIKES]->(portuguese)

MERGE (zhen)-[:LIKES]->(french)
MERGE (zhen)-[:LIKES]->(indian)

MERGE (michael)-[:LIKES]->(french)
MERGE (michael)-[:LIKES]->(italian)
MERGE (michael)-[:LIKES]->(indian)

MERGE (arya)-[:LIKES]->(lebanese)
MERGE (arya)-[:LIKES]->(italian)
MERGE (arya)-[:LIKES]->(portuguese)

MERGE (karin)-[:LIKES]->(lebanese)
MERGE (karin)-[:LIKES]->(italian)

Queries

MATCH (b:Person)-[v:LIKES]->(c:Cuisine)
WITH {item:id(b), categories: collect(id(c))} as vacatureData limit 50000
WITH collect(vacatureData) as data

CALL algo.similarity.jaccard(data, {concurrency:1, similarityCutoff:0.1})
YIELD nodes, similarityPairs, min, max, mean, p25, p50, p75, p90, p95
RETURN nodes, similarityPairs, min, max, mean, p25, p50, p75, p90, p95

results in

╒═══════╤═════════════════╤═══════════════════╤══════════════════╤═══════════════════╤══════════════════╤══════════════════╤═══════════════════╤══════════════════╤══════════════════╕
│"nodes"│"similarityPairs"│"min"              │"max"             │"mean"             │"p25"             │"p50"             │"p75"              │"p90"             │"p95"             │
╞═══════╪═════════════════╪═══════════════════╪══════════════════╪═══════════════════╪══════════════════╪══════════════════╪═══════════════════╪══════════════════╪══════════════════╡
│5      │7                │0.19999980926513672│0.6666669845581055│0.37380967821393696│0.2500009536743164│0.2500009536743164│0.33333301544189453│0.6666669845581055│0.6666669845581055│
└───────┴─────────────────┴───────────────────┴──────────────────┴───────────────────┴──────────────────┴──────────────────┴───────────────────┴──────────────────┴──────────────────┘

removing the concurrency limit

MATCH (b:Person)-[v:LIKES]->(c:Cuisine)
WITH {item:id(b), categories: collect(id(c))} as vacatureData limit 50000
WITH collect(vacatureData) as data

CALL algo.similarity.jaccard(data,  {similarityCutoff:0.1})
YIELD nodes, similarityPairs, min, max, mean, p25, p50, p75, p90, p95
RETURN nodes, similarityPairs, min, max, mean, p25, p50, p75, p90, p95

results in

╒═══════╤═════════════════╤═════╤═════╤══════╤═════╤═════╤═════╤═════╤═════╕
│"nodes"│"similarityPairs"│"min"│"max"│"mean"│"p25"│"p50"│"p75"│"p90"│"p95"│
╞═══════╪═════════════════╪═════╪═════╪══════╪═════╪═════╪═════╪═════╪═════╡
│5      │0                │0.0  │0.0  │0.0   │0.0  │0.0  │0.0  │0.0  │0.0  │
└───────┴─────────────────┴─────┴─────┴──────┴─────┴─────┴─────┴─────┴─────┘

Setting the concurrency to any number except for 1 results in the latter case. The same behaviour is observed when running with our 300k nodes Jaccard computation.

JorenVdV avatar Apr 16 '19 13:04 JorenVdV

Hey,

I'll take a look at it. I've seen this happen sporadically, but not been able to figure out exactly why it happens as it doesn't happen every time annoyingly.

e.g. I just tested this on a Docker image and it gives the same results with concurrency 1 and concurrency > 1.

Cheers, Mark

mneedham avatar Apr 17 '19 14:04 mneedham

Any resolution on this? Still not able to use > 1 core with algo.similarity.jaccard. I'm running 3.5.8 EE.

d-kilc avatar Mar 09 '20 22:03 d-kilc

Please check the https://github.com/neo4j/graph-data-science as it has improved graph algorithms, and it is also the successor for the graph algorithms library

tomasonjo avatar Mar 10 '20 08:03 tomasonjo