libpysal parallelize clique expansion

on relatively large graphs (n > ~10k) with colocated points, the clique expansion function can get really slow. After cutting it short a few times, I think this for-loop is where things are getting stuck. If that's the bottleneck, I think it should be pretty straightforward to parallelize using something like joblib

Aug 08 '24 19:08 knaaptime

My original implementation was pure pandas, based on join and explode with no loops. Would it make sense to revisit that?

Aug 08 '24 21:08 ljwolf

yeah i was wondering about explode there. I presume that would be even faster?

Aug 08 '24 21:08 knaaptime

I think so.

We moved away from it because (a) hashing the points to establish unique locations was expensive and (b) our policy on sorting was different, but I think we can still use pandas join and explode given the new function signature.

The earlier implementation, conceptually:

Built a series indexed on unique locations with values being a list of all the original indices at that location. This used a groupby on the geometries and agg(list) on the ids
Explode the aggregated series
join the exploded series back to the adjacency table, allowing join to handle the one to many expansion for edges entering/leaving cliques.
separately, built the all-pairs links within each clique using a join.
stack the within clique and beyond clique tables.

Aug 08 '24 21:08 ljwolf