leidenalg
leidenalg copied to clipboard
Support for Fixing the number of clusters
Hi @vtraag - thankyou for this amazing repo. It has been immensely useful.
I have a feature request - can you add an additional argument that fixes the number of communities ? Say, if I want 5 communities (even if that results in partitions with a lower modularity score) - is there a way to fix that ?
While going through your implementation - I have thought of a possible way of doing this (more like a post-processing step with the assumption that leiden will always give more communities than I need).
But wanted to check with you first, before going there.
I have a feature request - can you add an additional argument that fixes the number of communities ? Say, if I want 5 communities (even if that results in partitions with a lower modularity score) - is there a way to fix that ?
In principle, you can do this by optimising an existing partition, where you initially assign the existing partition a random initial membership, but limited to the desired number of communities. For example,
n_comms = 5
partition = la.CPMVertexPartition(G,
initial_membership=np.random.choice(n_comms, 100),
resolution_parameter=0.5)
where np
refers to numpy
.
If you then turn off the option to consider also empty communities for moving nodes, the optimiser only considers existing communities, i.e. it will never go beyond the indicated number of communities n_comms
.
opt = la.Optimiser()
opt.consider_empty_community = False
opt.optimise_partition(partition)
Of course, it is possible that you will end up having fewer communities than n_comms
, because such a partition will have a higher quality value. However, it will never have more communities than n_comms
.
While going through your implementation - I have thought of a possible way of doing this (more like a post-processing step with the assumption that leiden will always give more communities than I need).
But wanted to check with you first, before going there.
What was the idea you had in mind? I'm curious to hear about it!
@vtraag - thanks for your immediate response. I will consider your snippet as one of the possible ways to attack this.
In regard to my idea, it goes like this.
- After performing the partition algorithm - lets say we get
K
clusters. I assume that K typically won't be above 1000. - I then create the aggregated graph of the partition.
- If my target number of clusters is say
M
: I intend to iterate through each of the smallestK-M
partitions, and then for each partition - I check the decrease in modularity when combining the partition with one of theM
largest partitions. The partition for which the decrease is minimal - is the one that is chosen to merge the two partitions. - I think the time complexity should be
O(K^2 x time_taken_for_difference_in_modularity)
Do you see any problem with this approach ?
This is also an interesting approach. It is doing something else though: it tries to deal with minimum community sizes somehow. This is what is discussed in https://github.com/vtraag/leidenalg/issues/53, with the approach that I outlined being very similar to what you propose here.
@vtraag - I am not sure I understand why this would ensure a minimum community size.
If K= 10 and M = 5 and the size of the clusters are [10,9,8,6,1,1,1,1,1,1] then using the above approach - is not possible that the new list of clusters become [16,9,8,6,1] ? So all the single nodes go to the 10 cluster.
If this is not possible, can you throw some light as to why that wouldn't happen ?
If K= 10 and M = 5 and the size of the clusters are [10,9,8,6,1,1,1,1,1,1] then using the above approach - is not possible that the new list of clusters become [16,9,8,6,1] ? So all the single nodes go to the 10 cluster.
Yes, it is certainly possible that this would be an outcome.
3. I check the decrease in modularity when combining the partition with one of the
M
largest partitions. The partition for which the decrease is minimal - is the one that is chosen to merge the two partitions.
How you formulated this is that you were interested in simply keeping the largest M
communities, not just retain a specific number of communities.
All in all, it depends a bit on what you are trying to do. For example, if you are interested in just restricting the number of communities, it might be that a more even distribution of communities would be better than simply aggregating existing communities. For example, it could be that the communities with sizes [10,9,8,6,1,1,1,1,1,1]
should become communities of sizes [8, 8, 8, 8, 7]
if you want to restrict the number of communities. With strictly aggregating existing communities you don't necessarily arrive at that outcome. If you want to simply restrict the number of communities, I think that the approach that I outlined in my earlier reply would be preferable to aggregating communities.
@vtraag - I see the confusion now.
It is not a requirement for me that the clusters are evenly distributed. I just need to make sure that there are exactly M
clusters are provided.
I coded up the routine I mentioned in my previous messages. This is the trend I observe. The leiden algorithm outputs more than 20 clusters which I iteratively decrease to 8.
[136, 41, 86, 121, 54, 50, 48, 42, 28, 28, 14, 14, 8, 7, 5, 5, 4, 2, 2, 2, 2, 2] [136, 41, 86, 121, 54, 50, 48, 44, 28, 28, 14, 14, 8, 7, 5, 5, 4, 2, 2, 2, 2] [136, 41, 86, 121, 54, 50, 48, 46, 28, 28, 14, 14, 8, 7, 5, 5, 4, 2, 2, 2] [136, 41, 86, 121, 54, 50, 48, 48, 28, 28, 14, 14, 8, 7, 5, 5, 4, 2, 2] [136, 41, 86, 121, 54, 50, 48, 50, 28, 28, 14, 14, 8, 7, 5, 5, 4, 2] [136, 41, 86, 121, 54, 50, 48, 52, 28, 28, 14, 14, 8, 7, 5, 5, 4] [136, 41, 86, 121, 54, 50, 48, 56, 28, 28, 14, 14, 8, 7, 5, 5] [136, 41, 91, 121, 54, 50, 48, 56, 28, 28, 14, 14, 8, 7, 5] [141, 41, 91, 121, 54, 50, 48, 56, 28, 28, 14, 14, 8, 7] [141, 41, 91, 121, 61, 50, 48, 56, 28, 28, 14, 14, 8] [141, 41, 91, 121, 61, 50, 48, 64, 28, 28, 14, 14] [141, 55, 91, 121, 61, 50, 48, 64, 28, 28, 14] [141, 55, 91, 121, 61, 50, 62, 64, 28, 28] [141, 55, 91, 121, 89, 50, 62, 64, 28]
OK, good to see you got something working that does the trick for you! If you post your code here, it might also help some other people who would like to do the same.
One final suggestion: in your final result consisting of 8 communities, you can still let the Leiden algorithm run while setting consider_empty_community = False
. That way, it will keep only the 8 communities that you have already detected, but it will try to find a better assignment still.
@vtraag - thanks for your suggestion. I have incorporated it in my code. Please take a look at the code below and if possible, please provide feedback.
### assuming that 'partition' is already an output that has come from running leiden
### and that the number of communities in leiden are too many. We want to reduce
### the number of communities to 'num_clusters'
# instantiate optimiser
opt = Optimiser()
opt.consider_empty_community = False
# get the size of the community clustes
sizes = partition.sizes()
if len(sizes) < num_clusters :
raise ValueError("Number of communities is already lesser than 'num_clusters'")
elif len(sizes) == num_clusters :
return partition
else :
orig_num_clusters = len(sizes)
for i in range(num_clusters,orig_num_clusters) :
# from the node level partition - aggregate to form the super nodes
aggregate_partition = partition.aggregate_partition(partition)
# get the size of the community clustes
sizes = partition.sizes()
# take the smallest community with the largest index
minimum = np.min(sizes)
smallest = np.where(sizes==minimum)[0][-1]
# the "num_cluster" largest communities
arg_sort = np.argsort(sizes)
fixed = arg_sort[len(sizes)-num_clusters:]
max_diff = -np.inf
cluster_idx = -1
for fix in fixed :
diff = aggregate_partition.diff_move(smallest,fix)
if diff > max_diff :
max_diff = diff
cluster_idx = fix
# found the cluster that it should be a part of
aggregate_partition.move_node(smallest,cluster_idx)
# now one of the communities has a zero size. that community can have the 'last' index
# in the sizes() array or it can be somewhere in the 'middle'. If it has the 'last' index
# then the 'from_coarse_partition' function will automatically remove the last cluser but
# if the empty community is in the middle (and not at the last index) then 'from_coarse_partition'
# won't. To solve the latter - we optimise the partition.
if not(aggregate_partition.sizes()[-1] == 0) :
# remove the empty community that exists in the 'middle'
opt.optimise_partition(aggregate_partition)
# update parition with the new community
partition.from_coarse_partition(aggregate_partition)
# final refinement
opt.optimise_partition(partition)
# sanity checks
assert(len(partition.sizes())==num_clusters)
return partition
I also have a small question regarding your suggestion. If I run the leiden algorithm on my final output, is it possible that I will get less than M
clusters ?
@vtraag - I have also found the function renumber_communities
- that renumbers the sizes array.
This is to prevent the below complicated logic.
# now one of the communities has a zero size. that community can have the 'last' index
# in the sizes() array or it can be somewhere in the 'middle'. If it has the 'last' index
# then the 'from_coarse_partition' function will automatically remove the last cluser but
# if the empty community is in the middle (and not at the last index) then 'from_coarse_partition'
# won't. To solve the latter - we optimise the partition.
if not(aggregate_partition.sizes()[-1] == 0) :
# remove the empty community that exists in the 'middle'
opt.optimise_partition(aggregate_partition)
https://github.com/vtraag/leidenalg/blob/c4dcc64a4b6895dcc269f6e332431120408f0802/src/leidenalg/VertexPartition.py#L256
Can you please verify from your end that I can use this to keep the zero sized cluster at the end.
Overall, the script looks good to me, thanks for sharing!
if not(aggregate_partition.sizes()[-1] == 0) : # remove the empty community that exists in the 'middle' opt.optimise_partition(aggregate_partition)
In principle this not only removes the last empty community, it also optimizes the remaining assignments (without considering an empty community). It is possible that this already gets you less than M
clusters, but this is unlikely.
I also have a small question regarding your suggestion. If I run the leiden algorithm on my final output, is it possible that I will get less than
M
clusters ?
Yes, similar to the optimise_partition
call that you use earlier, in principle the last call could result in fewer than M
clusters. However, given that the original quality is higher with more clusters (presumably), I think it is unlikely to result in fewer clusters.
@vtraag - I have also found the function
renumber_communities
- that renumbers the sizes array. Can you please verify from your end that I can use this to keep the zero sized cluster at the end.
Yes, this will remove empty communities. In principle, it would probably be better to indeed remove empty communities using renumber_communities
instead of optimise_partition
, and only run optimise_partition
once when you have obtained the desired number of clusters.
Finally, one other comment: I don't think it's necessary to get the aggregate partition and update the original partition for each cluster that you change. You can simply get the aggregation partition once, then update the original partition using from_coarse_partition
at the very end. You only need to make sure that you carefully track the total size for each community in sizes
. However, if this script already does the job for you and you don't need to make it more efficient, then you can just forget about this.
@vtraag - thanks for your approval. As of now, performance does not seem to be a concern.
@vtraag - do you think this would be a useful addition to you code repository ? If you think so, I do not mind creating a pull request for this functionality.
Sorry for the late response @Alex-Mathai-98 ! In principle, the functionality could be interesting for others. However, I would prefer to have an implementation in the C++ layer that could then be exposed in Python, instead of doing it in Python all together. Moreover, performance will become an issue at some point, so I do prefer to make sure it is sufficiently performant before integrating it in the library. If you are open to doing this, I would be willing to work with you on a PR! If not, I'll try to see if I can find some time in the future to implement such a feature.
Sorry for bumping up this issue, but I have a related question.
I would be interested to use the consider_empty_community
feature from the leiden
R package that interfaces with leidenalg
.
Unfortunately, this is not currently possible as the main function of the leiden
package relies on leidenalg.find_partition
which instantiate internally the optimiser object (see issue#27).
A possible solution would be to add a new argument empty_community
to find_partition
(same as for max_comm_size
):
+ def find_partition(graph, partition_type, initial_membership=None, weights=None, n_iterations=2, max_comm_size=0, empty_community=True, seed=None, **kwargs):
....
optimiser.max_comm_size = max_comm_size
+ optimiser.consider_empty_community = empty_community
Would such a change be possible or not ?