agensgraph icon indicating copy to clipboard operation
agensgraph copied to clipboard

Finding connected component using a vertex property

Open 4j4y opened this issue 3 years ago • 2 comments

Hello, I would like to find the connected component in a graph that has given vertex. For example:

Sample Dataset

CREATE GRAPH dummy_with_epochs_4;
SET graph_path=dummy_with_epochs_4;
create elabel dates;
create (alice: female { name: 'Alice'}), (barbara: female { name: 'Barbara'}), (chloe: female { name: 'Chloe'}), (diane: female { name: 'Diane'}), (eve: female { name: 'Eve'}), (uhura: female { name: 'Uhura'}), (adam: male { name: 'Adam'}), (bob: male { name: 'Bob'}), (charles: male { name: 'Charles'}), (daniel: male { name: 'Daniel'}), (edgar: male { name: 'Edgar'}), (spock: male { name: 'Spock'});
match (a:female) , (b:male)  where a.name = 'Alice' and b.name =  'Bob' merge (a)-[:dates {epoch:[2]}]-(b);
  match (a:female) , (b:male)  where a.name = 'Barbara' and b.name =  'Bob' merge (a)-[:dates {epoch:[3]}]-(b);
  match (a:female) , (b:male)  where a.name = 'Barbara' and b.name =  'Charles' merge (a)-[:dates {epoch:[1]}]-(b);
  match (a:female) , (b:male)  where a.name = 'Barbara' and b.name =  'Edgar' merge (a)-[:dates {epoch:[2]}]-(b);
  match (a:female) , (b:male)  where a.name = 'Chloe' and b.name =  'Bob' merge (a)-[:dates {epoch:[1]}]-(b);
  match (a:female) , (b:male)  where a.name = 'Chloe' and b.name =  'Daniel' merge (a)-[:dates {epoch:[3]}]-(b);
  match (a:female) , (b:male)  where a.name = 'Chloe' and b.name =  'Edgar' merge (a)-[:dates {epoch:[1]}]-(b);
  match (a:female) , (b:male)  where a.name = 'Diane' and b.name =  'Edgar' merge (a)-[:dates {epoch:[2]}]-(b);
  match (a:female) , (b:male)  where a.name = 'Eve' and b.name =  'Edgar' merge (a)-[:dates {epoch:[3]}]-(b);
  match (a:female) , (b:male)  where a.name = 'Uhura' and b.name =  'Spock' merge (a)-[:dates {epoch:[2]}]-(b);

Edge presence

match (f:female {name: 'Barbara'})-[d:dates]-(m:Male {name:'Bob'}) return count(d);  

Edge insert

match (a:female), (m:Male) WHERE a.name = 'Barbara' and b.name = 'Bob' merge (a)-[d:dates {epoch: [3]}]-(b);   

Edge Upsert

foobar=# MATCH (c:female {name: 'Barbara'})-[d:dates]-(m:Male {name: 'Bob'}) SET d.epoch = d.epoch || 4::jsonb;  

Find the connected component nodes (apart from itself)

foobar=# MATCH (f:female { name: 'Barbara' })-[date:dates*]-(b) where  all(d in date::jsonb where any(epoch in d.properties.epoch where epoch > 3))  return distinct b::jsonb;
        b
-----------------
 {"name": "Bob"}
(1 row)

Our dataset size is 8 million events ingested with a total of 2397596 nodes and 2632684 edges.

Here the problem is when we try to find the connected component for a certain epoch range, it takes lots of time (In the magnitude of minutes in some cases) in iterating over the epoch array.

I tried to create a GIN index on epoch but it doesn't work with greater-than and less-than operators as GIN indexes only works with exists operator.

Is there another way to find the connected component size in the given epoch range? Or someway to optimize the performance of the existing query?

4j4y avatar Oct 19 '20 16:10 4j4y

Any help with this? @jrgemignani @protodef

4j4y avatar Oct 21 '20 07:10 4j4y

(FYI A slight change in the above-mentioned edge property epoch from array to an integer.)

Query plan looks like this:

MATCH (:male { name: 'Barbara' })-[date:dates*]-(b) where  all(dates in date::jsonb where dates.properties.epoch >= 160142400)  return distinct b::jsonb;
 b
---
                                                                                         QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
----------
 Unique  (cost=533.73..574.39 rows=8131 width=32) (actual time=37079.043..37548.732 rows=16 loops=1)
   ->  Sort  (cost=533.73..554.06 rows=8131 width=32) (actual time=37079.042..37373.074 rows=411169 loops=1)
         Sort Key: (ROW(b.id, b.properties, b.ctid)::vertex)
         Sort Method: external merge  Disk: 31672kB
         ->  Nested Loop  (cost=1.27..5.66 rows=8131 width=32) (actual time=0.081..36115.146 rows=411169 loops=1)
               ->  Nested Loop  (cost=1.27..4.72 rows=1 width=8) (actual time=0.068..35025.686 rows=411169 loops=1)
                     ->  Index Scan using male_name_idx on male "<0000000078>"  (cost=0.42..0.45 rows=1 width=8) (actual time=0.024..0.025 rows=1 loops=1)
                           Index Cond: (properties.'name'::text = '"Barbara"'::jsonb)
                     ->  Subquery Scan on date  (cost=0.85..4.26 rows=1 width=8) (actual time=0.043..34980.963 rows=411169 loops=1)
                           Filter: (length([dates IN to_jsonb(date.edges) WHERE (dates.'properties'::text.'epoch'::text >= to_jsonb(160142400))]) = length(to_jsonb(date
.edges)))
                           ->  Nested Loop VLE [1..]  (cost=0.85..3.22 rows=38 width=128) (actual time=0.016..2352.035 rows=411169 loops=1)
                                 ->  Result  (cost=0.42..0.93 rows=2 width=80) (actual time=0.015..0.020 rows=1 loops=1)
                                       ->  Append  (cost=0.42..0.91 rows=2 width=114) (actual time=0.011..0.015 rows=1 loops=1)
                                             ->  Index Scan using dates_start_idx on dates  (cost=0.42..0.45 rows=1 width=114) (actual time=0.011..0.012 rows=1 loops=1)
                                                   Index Cond: ("<0000000078>".id = start)
                                             ->  Index Scan using dates_end_idx on dates dates_1  (cost=0.42..0.45 rows=1 width=114) (actual time=0.002..0.002 rows=0 loops=1)
                                                   Index Cond: ("<0000000078>".id = "end")
                                 ->  Result  (cost=0.42..0.93 rows=2 width=48) (actual time=0.002..0.005 rows=3 loops=411169)
                                       ->  Append  (cost=0.42..0.91 rows=2 width=106) (actual time=0.002..0.004 rows=3 loops=411169)
                                             ->  Index Scan using dates_start_idx on dates dates_2  (cost=0.42..0.45 rows=1 width=106) (actual time=0.001..0.002 rows=1 loops=4111
69)
                                                   Index Cond: ($1 = start)
                                             ->  Index Scan using dates_end_idx on dates dates_3  (cost=0.42..0.45 rows=1 width=106) (actual time=0.001..0.002 rows=2 loops=411169
)
                                                   Index Cond: ($1 = "end")
               ->  Append  (cost=0.00..0.91 rows=3 width=49) (actual time=0.002..0.002 rows=1 loops=411169)
                     ->  Seq Scan on ag_vertex b  (cost=0.00..0.00 rows=1 width=46) (actual time=0.000..0.000 rows=0 loops=411169)
                           Filter: (date._end = id)
                     ->  Index Scan using male_pkey on male b_1  (cost=0.42..0.45 rows=1 width=44) (actual time=0.001..0.001 rows=1 loops=411169)
                           Index Cond: (id = date._end)
                     ->  Index Scan using device_pkey on device b_2  (cost=0.42..0.45 rows=1 width=54) (actual time=0.000..0.000 rows=0 loops=411169)
                           Index Cond: (id = date._end)
 Planning time: 0.688 ms
 Execution time: 37552.191 ms

As you can see most of the time is taken in filtering the edges on the epoch. I am getting total paths as 411169, which are quite high it is taking too much time. Is there a better way to find the minimum number of paths between given vertex and other nodes in the component. Ideally I think it should have been maximum of 15 paths.

4j4y avatar Oct 21 '20 12:10 4j4y