libpysal icon indicating copy to clipboard operation
libpysal copied to clipboard

parallelize clique expansion

Open knaaptime opened this issue 1 year ago • 3 comments

on relatively large graphs (n > ~10k) with colocated points, the clique expansion function can get really slow. After cutting it short a few times, I think this for-loop is where things are getting stuck. If that's the bottleneck, I think it should be pretty straightforward to parallelize using something like joblib

knaaptime avatar Aug 08 '24 19:08 knaaptime

My original implementation was pure pandas, based on join and explode with no loops. Would it make sense to revisit that?

ljwolf avatar Aug 08 '24 21:08 ljwolf

yeah i was wondering about explode there. I presume that would be even faster?

knaaptime avatar Aug 08 '24 21:08 knaaptime

I think so.

We moved away from it because (a) hashing the points to establish unique locations was expensive and (b) our policy on sorting was different, but I think we can still use pandas join and explode given the new function signature.

The earlier implementation, conceptually:

  • Built a series indexed on unique locations with values being a list of all the original indices at that location. This used a groupby on the geometries and agg(list) on the ids
  • Explode the aggregated series
  • join the exploded series back to the adjacency table, allowing join to handle the one to many expansion for edges entering/leaving cliques.
  • separately, built the all-pairs links within each clique using a join.
  • stack the within clique and beyond clique tables.

ljwolf avatar Aug 08 '24 21:08 ljwolf