parallelize clique expansion
on relatively large graphs (n > ~10k) with colocated points, the clique expansion function can get really slow. After cutting it short a few times, I think this for-loop is where things are getting stuck. If that's the bottleneck, I think it should be pretty straightforward to parallelize using something like joblib
My original implementation was pure pandas, based on join and explode with no loops. Would it make sense to revisit that?
yeah i was wondering about explode there. I presume that would be even faster?
I think so.
We moved away from it because (a) hashing the points to establish unique locations was expensive and (b) our policy on sorting was different, but I think we can still use pandas join and explode given the new function signature.
The earlier implementation, conceptually:
- Built a series indexed on unique locations with values being a list of all the original indices at that location. This used a groupby on the geometries and agg(list) on the ids
- Explode the aggregated series
- join the exploded series back to the adjacency table, allowing join to handle the one to many expansion for edges entering/leaving cliques.
- separately, built the all-pairs links within each clique using a join.
- stack the within clique and beyond clique tables.