leidenalg Support for Fixing the number of clusters

Hi @vtraag - thankyou for this amazing repo. It has been immensely useful.

I have a feature request - can you add an additional argument that fixes the number of communities ? Say, if I want 5 communities (even if that results in partitions with a lower modularity score) - is there a way to fix that ?

While going through your implementation - I have thought of a possible way of doing this (more like a post-processing step with the assumption that leiden will always give more communities than I need).

But wanted to check with you first, before going there.

Nov 09 '21 07:11 Alex-Mathai-98

I have a feature request - can you add an additional argument that fixes the number of communities ? Say, if I want 5 communities (even if that results in partitions with a lower modularity score) - is there a way to fix that ?

In principle, you can do this by optimising an existing partition, where you initially assign the existing partition a random initial membership, but limited to the desired number of communities. For example,

n_comms = 5
partition = la.CPMVertexPartition(G, 
                                  initial_membership=np.random.choice(n_comms, 100),
                                  resolution_parameter=0.5)

where np refers to numpy.

If you then turn off the option to consider also empty communities for moving nodes, the optimiser only considers existing communities, i.e. it will never go beyond the indicated number of communities n_comms.

opt = la.Optimiser()
opt.consider_empty_community = False
opt.optimise_partition(partition)

Of course, it is possible that you will end up having fewer communities than n_comms, because such a partition will have a higher quality value. However, it will never have more communities than n_comms.

While going through your implementation - I have thought of a possible way of doing this (more like a post-processing step with the assumption that leiden will always give more communities than I need).

But wanted to check with you first, before going there.

What was the idea you had in mind? I'm curious to hear about it!

Nov 09 '21 11:11 vtraag

@vtraag - thanks for your immediate response. I will consider your snippet as one of the possible ways to attack this.

In regard to my idea, it goes like this.

After performing the partition algorithm - lets say we get K clusters. I assume that K typically won't be above 1000.
I then create the aggregated graph of the partition.
If my target number of clusters is say M : I intend to iterate through each of the smallest K-M partitions, and then for each partition - I check the decrease in modularity when combining the partition with one of the M largest partitions. The partition for which the decrease is minimal - is the one that is chosen to merge the two partitions.
I think the time complexity should be O(K^2 x time_taken_for_difference_in_modularity)

Do you see any problem with this approach ?

Nov 09 '21 12:11 Alex-Mathai-98

This is also an interesting approach. It is doing something else though: it tries to deal with minimum community sizes somehow. This is what is discussed in https://github.com/vtraag/leidenalg/issues/53, with the approach that I outlined being very similar to what you propose here.

Nov 09 '21 14:11 vtraag

@vtraag - I am not sure I understand why this would ensure a minimum community size.

If K= 10 and M = 5 and the size of the clusters are [10,9,8,6,1,1,1,1,1,1] then using the above approach - is not possible that the new list of clusters become [16,9,8,6,1] ? So all the single nodes go to the 10 cluster.

If this is not possible, can you throw some light as to why that wouldn't happen ?

Nov 09 '21 15:11 Alex-Mathai-98

If K= 10 and M = 5 and the size of the clusters are [10,9,8,6,1,1,1,1,1,1] then using the above approach - is not possible that the new list of clusters become [16,9,8,6,1] ? So all the single nodes go to the 10 cluster.

Yes, it is certainly possible that this would be an outcome.

3. I check the decrease in modularity when combining the partition with one of the M largest partitions. The partition for which the decrease is minimal - is the one that is chosen to merge the two partitions.

How you formulated this is that you were interested in simply keeping the largest M communities, not just retain a specific number of communities.

All in all, it depends a bit on what you are trying to do. For example, if you are interested in just restricting the number of communities, it might be that a more even distribution of communities would be better than simply aggregating existing communities. For example, it could be that the communities with sizes [10,9,8,6,1,1,1,1,1,1] should become communities of sizes [8, 8, 8, 8, 7] if you want to restrict the number of communities. With strictly aggregating existing communities you don't necessarily arrive at that outcome. If you want to simply restrict the number of communities, I think that the approach that I outlined in my earlier reply would be preferable to aggregating communities.

Nov 09 '21 15:11 vtraag

@vtraag - I see the confusion now.

It is not a requirement for me that the clusters are evenly distributed. I just need to make sure that there are exactly M clusters are provided.

I coded up the routine I mentioned in my previous messages. This is the trend I observe. The leiden algorithm outputs more than 20 clusters which I iteratively decrease to 8.

[136, 41, 86, 121, 54, 50, 48, 42, 28, 28, 14, 14, 8, 7, 5, 5, 4, 2, 2, 2, 2, 2] [136, 41, 86, 121, 54, 50, 48, 44, 28, 28, 14, 14, 8, 7, 5, 5, 4, 2, 2, 2, 2] [136, 41, 86, 121, 54, 50, 48, 46, 28, 28, 14, 14, 8, 7, 5, 5, 4, 2, 2, 2] [136, 41, 86, 121, 54, 50, 48, 48, 28, 28, 14, 14, 8, 7, 5, 5, 4, 2, 2] [136, 41, 86, 121, 54, 50, 48, 50, 28, 28, 14, 14, 8, 7, 5, 5, 4, 2] [136, 41, 86, 121, 54, 50, 48, 52, 28, 28, 14, 14, 8, 7, 5, 5, 4] [136, 41, 86, 121, 54, 50, 48, 56, 28, 28, 14, 14, 8, 7, 5, 5] [136, 41, 91, 121, 54, 50, 48, 56, 28, 28, 14, 14, 8, 7, 5] [141, 41, 91, 121, 54, 50, 48, 56, 28, 28, 14, 14, 8, 7] [141, 41, 91, 121, 61, 50, 48, 56, 28, 28, 14, 14, 8] [141, 41, 91, 121, 61, 50, 48, 64, 28, 28, 14, 14] [141, 55, 91, 121, 61, 50, 48, 64, 28, 28, 14] [141, 55, 91, 121, 61, 50, 62, 64, 28, 28] [141, 55, 91, 121, 89, 50, 62, 64, 28]

Nov 10 '21 07:11 Alex-Mathai-98

OK, good to see you got something working that does the trick for you! If you post your code here, it might also help some other people who would like to do the same.

One final suggestion: in your final result consisting of 8 communities, you can still let the Leiden algorithm run while setting consider_empty_community = False. That way, it will keep only the 8 communities that you have already detected, but it will try to find a better assignment still.

Nov 10 '21 07:11 vtraag

@vtraag - thanks for your suggestion. I have incorporated it in my code. Please take a look at the code below and if possible, please provide feedback.

### assuming that 'partition' is already an output that has come from running leiden
### and that the number of communities in leiden are too many. We want to reduce
### the number of communities to 'num_clusters'

# instantiate optimiser
opt = Optimiser()
opt.consider_empty_community = False

# get the size of the community clustes
sizes = partition.sizes()

if len(sizes) < num_clusters :
    raise ValueError("Number of communities is already lesser than 'num_clusters'")
elif len(sizes) == num_clusters :
    return partition
else :
    orig_num_clusters = len(sizes)

    for i in range(num_clusters,orig_num_clusters) :

        # from the node level partition - aggregate to form the super nodes 
        aggregate_partition = partition.aggregate_partition(partition)

        # get the size of the community clustes
        sizes = partition.sizes()

        # take the smallest community with the largest index
        minimum = np.min(sizes)
        smallest = np.where(sizes==minimum)[0][-1]

        # the "num_cluster" largest communities
        arg_sort = np.argsort(sizes)
        fixed = arg_sort[len(sizes)-num_clusters:]

        max_diff = -np.inf
        cluster_idx = -1
        for fix in fixed : 
            diff = aggregate_partition.diff_move(smallest,fix)
            if diff > max_diff :
                max_diff = diff
                cluster_idx = fix

        # found the cluster that it should be a part of
        aggregate_partition.move_node(smallest,cluster_idx)

        # now one of the communities has a zero size. that community can have the 'last' index
        # in the sizes() array or it can be somewhere in the 'middle'. If it has the 'last' index
        # then the 'from_coarse_partition' function will automatically remove the last cluser but
        # if the empty community is in the middle (and not at the last index) then 'from_coarse_partition'
        # won't. To solve the latter - we optimise the partition.
        if not(aggregate_partition.sizes()[-1] == 0) :
            # remove the empty community that exists in the 'middle'
            opt.optimise_partition(aggregate_partition)

        # update parition with the new community
        partition.from_coarse_partition(aggregate_partition)			

    # final refinement				
    opt.optimise_partition(partition)
    
    # sanity checks
    assert(len(partition.sizes())==num_clusters)

    return partition

I also have a small question regarding your suggestion. If I run the leiden algorithm on my final output, is it possible that I will get less than M clusters ?

Nov 11 '21 05:11 Alex-Mathai-98

@vtraag - I have also found the function renumber_communities - that renumbers the sizes array.

This is to prevent the below complicated logic.

# now one of the communities has a zero size. that community can have the 'last' index
# in the sizes() array or it can be somewhere in the 'middle'. If it has the 'last' index
# then the 'from_coarse_partition' function will automatically remove the last cluser but
# if the empty community is in the middle (and not at the last index) then 'from_coarse_partition'
# won't. To solve the latter - we optimise the partition.
if not(aggregate_partition.sizes()[-1] == 0) :
    # remove the empty community that exists in the 'middle'
    opt.optimise_partition(aggregate_partition)

https://github.com/vtraag/leidenalg/blob/c4dcc64a4b6895dcc269f6e332431120408f0802/src/leidenalg/VertexPartition.py#L256

Can you please verify from your end that I can use this to keep the zero sized cluster at the end.

Nov 15 '21 05:11 Alex-Mathai-98

Overall, the script looks good to me, thanks for sharing!

    if not(aggregate_partition.sizes()[-1] == 0) :
        # remove the empty community that exists in the 'middle'
        opt.optimise_partition(aggregate_partition)

In principle this not only removes the last empty community, it also optimizes the remaining assignments (without considering an empty community). It is possible that this already gets you less than M clusters, but this is unlikely.

I also have a small question regarding your suggestion. If I run the leiden algorithm on my final output, is it possible that I will get less than M clusters ?

Yes, similar to the optimise_partition call that you use earlier, in principle the last call could result in fewer than M clusters. However, given that the original quality is higher with more clusters (presumably), I think it is unlikely to result in fewer clusters.

@vtraag - I have also found the function renumber_communities - that renumbers the sizes array. Can you please verify from your end that I can use this to keep the zero sized cluster at the end.

Yes, this will remove empty communities. In principle, it would probably be better to indeed remove empty communities using renumber_communities instead of optimise_partition, and only run optimise_partition once when you have obtained the desired number of clusters.

Finally, one other comment: I don't think it's necessary to get the aggregate partition and update the original partition for each cluster that you change. You can simply get the aggregation partition once, then update the original partition using from_coarse_partition at the very end. You only need to make sure that you carefully track the total size for each community in sizes. However, if this script already does the job for you and you don't need to make it more efficient, then you can just forget about this.

Nov 16 '21 15:11 vtraag

@vtraag - thanks for your approval. As of now, performance does not seem to be a concern.

Nov 17 '21 05:11 Alex-Mathai-98

@vtraag - do you think this would be a useful addition to you code repository ? If you think so, I do not mind creating a pull request for this functionality.

Nov 17 '21 13:11 Alex-Mathai-98

Sorry for the late response @Alex-Mathai-98 ! In principle, the functionality could be interesting for others. However, I would prefer to have an implementation in the C++ layer that could then be exposed in Python, instead of doing it in Python all together. Moreover, performance will become an issue at some point, so I do prefer to make sure it is sufficiently performant before integrating it in the library. If you are open to doing this, I would be willing to work with you on a PR! If not, I'll try to see if I can find some time in the future to implement such a feature.

Dec 16 '21 07:12 vtraag

Sorry for bumping up this issue, but I have a related question.

I would be interested to use the consider_empty_community feature from the leiden R package that interfaces with leidenalg. Unfortunately, this is not currently possible as the main function of the leiden package relies on leidenalg.find_partition which instantiate internally the optimiser object (see issue#27).

A possible solution would be to add a new argument empty_community to find_partition (same as for max_comm_size):

+ def find_partition(graph, partition_type, initial_membership=None, weights=None, n_iterations=2, max_comm_size=0, empty_community=True, seed=None, **kwargs):
  ....
  optimiser.max_comm_size = max_comm_size
+ optimiser.consider_empty_community = empty_community

Would such a change be possible or not ?

Jul 18 '23 13:07 Gilquin

leidenalg leidenalg copied to clipboard

Support for Fixing the number of clusters

leidenalg
leidenalg copied to clipboard