pyclustering Better PAM initialization with BUILD

The PAM/k-medoids implementation appears to implement SWAP, but not the BUILD part for initializing PAM. Instead you have to provide good starting medoids.

I tired benchmarking it on a larger data set (well-known 20news data set in the sklearn.datasets.fetch_20newsgroups_vectorized version; with cosine distance and k=20) and the runtime of pyclustering kmedoids was extremely high (2000 sec), likely because of the poor random initialization.

With BUILD from the kmedoids package, I can reduce the run time to 338 sec. Nevertheless, pam from kmedoids is just 37 seconds, including 4.6 seconds for BUILD. The fasterpam variant finishes in 336 msec with random initialization.

It would also be good to get access to the final loss after optimization as well as the number of iterations. Then I could check if it ran into the maximum iteration limit, and if at least the result quality is comparable.

Feb 05 '21 23:02 kno10

Hello @kno10 ,

There is a kmeans++ based algorithm to initialize it on which you are referring in your article:

In the experiments, we will also study whether a clever sampling-based approach similar to k-means++ (Arthur and Vassilvitskii, 2007) that needs only O(nk) time but produces a worse starting point is an interesting alternative.

It could be used to initialize initial medoids and I expected that only quality is going to be affected. Nevertheless your outcome looks like a good point to implement PAM BUILD algorithm for initialization. I am not sure if it is convenient to make a part of the interface of PAM algorithm.

Also I have a question for you, did you use installed pyclustering version or just cloned version? Because in case of installed version, C++ implementation of the algorithm should be used and in case of cloned version, the library uses Python implementation due to lack of binaries in git repository.

It would also be good to get access to the final loss after optimization as well as the number of iterations

It is clear about the number of iterations. But I have question about the final loss after optimization. Do you mean total deviation (TD)?

Feb 06 '21 13:02 annoviko

I tried using your k-means++ initialization, but I believe it only accepts a data matrix, not a distance matrix, as input.

For that experiment I used "pip install pyclustering" on Google colab, whatever that does by default.

Yes, TD (total deviation) is the loss.

Feb 06 '21 14:02 kno10

Thank you for the clarification. One more question:

I tried using your k-means++ initialization, but I believe it only accepts a data matrix, not a distance matrix, as input.

Was a distance matrix used as an input for the algorithm in the experiment?

Feb 06 '21 14:02 annoviko

Hello @kno10 ,

I have introduced PAM BUILD algorithm in line with your article and I have optimized C++ version (that should be used by default) of K-Medoids (PAM) algorithm. I have attached pyclustering package (pyclustering-0.11.0.tar.gz) with all binaries that was created from master branch (not a release version) where these changes are available. Could you please check it again?

Just in case K-Means++ supports distance matrix now. Let me know if you need code example as well.

There is an example how to use K-Medoids with PAM BUILD:

from pyclustering.cluster.kmedoids import kmedoids, build
from pyclustering.cluster import cluster_visualizer
from pyclustering.utils import read_sample
from pyclustering.samples.definitions import FCPS_SAMPLES

# Load list of points `Tetra` for cluster analysis.
sample = read_sample(FCPS_SAMPLES.SAMPLE_TETRA)

# Initialize initial medoids using PAM BUILD algorithm
initial_medoids = build(sample, 4).initialize()

# Create instance of K-Medoids (PAM) algorithm.
kmedoids_instance = kmedoids(sample, initial_medoids)

# Run cluster analysis
kmedoids_instance.process()

# Display clustering results to console.
print("Clusters:", kmedoids_instance.get_clusters())
print("Labels:", kmedoids_instance.get_labels())
print("Medoids:", kmedoids_instance.get_medoids())
print("Total Deviation:", kmedoids_instance.get_total_deviation())
print("Iterations:", kmedoids_instance.get_iterations())

# Display clustering results.
visualizer = cluster_visualizer()
visualizer.append_clusters(kmedoids_instance.get_clusters(), sample)
visualizer.append_cluster(initial_medoids, sample, markersize=120, marker='*', color='dimgray')
visualizer.append_cluster(kmedoids_instance.get_medoids(), sample, markersize=120, marker='*', color='black')
visualizer.show()

Output example:

Clusters: [[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99], [100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199], [200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299], [300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376, 377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399]]
Labels: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]
Medoids: [0, 100, 201, 300]
Total Deviation: 239.25250279395595
Iterations: 3

Feb 12 '21 19:02 annoviko

@kno10 , I would be really pleased if you provide test code that you was using for performance testing.

Feb 12 '21 19:02 annoviko

@kno10 , I decided to repeat your experiment using sklearn.datasets.fetch_20newsgroups_vectorized.

Results of the original PAM without distance matrix (list of points) still aren't good:

Medoids: [6953, 10558, 489, 1034, 6910, 3522, 10246, 2692, 1965, 9385, 6535, 7433, 6540, 2760, 459, 10860, 6233, 4006, 484, 11189]
Total Deviation: 29.767144356027746
Iterations: 102
PAM BUILD: 36.062500 sec.
PAM: 2187.406250 sec.

PAM BUILD implementation: https://github.com/annoviko/pyclustering/blob/master/ccore/src/cluster/pam_build.cpp

I was using the following code to get data and run it on PAM. Is it similar to what you was doing?

import time
from pyclustering.cluster.kmedoids import kmedoids, build
from sklearn.decomposition import TruncatedSVD
from sklearn.datasets import fetch_20newsgroups_vectorized

sample = fetch_20newsgroups_vectorized()
data = fetch_20newsgroups_vectorized()
X, y = data['data'], data['target']
X = TruncatedSVD().fit_transform(X)

t_start = time.process_time()
initial_medoids = build(X, 20).initialize()
pam_build_time = time.process_time() - t_start

t_start = time.process_time()
kmedoids_instance = kmedoids(X, initial_medoids)
kmedoids_instance.process()
pam_time = time.process_time() - t_start

print("Medoids:", kmedoids_instance.get_medoids())
print("Total Deviation:", kmedoids_instance.get_total_deviation())
print("Iterations:", kmedoids_instance.get_iterations())

print("PAM BUILD: %f sec." % pam_build_time)
print("PAM: %f sec." % pam_time)

Feb 12 '21 21:02 annoviko

Please see https://colab.research.google.com/drive/1DNzGbQns5-kiyTVkDvAorcxqXZb5ukEI?usp=sharing It still appears to take a very long time. It does not appear to use ccore?

Feb 12 '21 23:02 kno10

@kno10 , thanks, your performance test is different and it results look much better. By default pyclustering always uses C++ code. I have run it my local machine and got the following results. Could you please take a look at them?

(11314, 130107)
(11314, 11314) took 7.680890083312988 s
['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_build__amount', '_build__calculate_first_medoid', '_build__calculate_next_medoid', '_build__ccore', '_build__create_distance_calculator', '_build__data', '_build__data_type', '_build__distance_calculator', '_build__initialize_by_ccore', '_build__initialize_by_python', '_build__metric', '_build__verify_arguments', 'initialize']
Build took: {} 37.16072940826416
<pyclustering.cluster.kmedoids.kmedoids object at 0x000001B85C41F780> ['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_kmedoids__calculate_swap_cost', '_kmedoids__ccore', '_kmedoids__clusters', '_kmedoids__create_distance_calculator', '_kmedoids__data_type', '_kmedoids__distance_calculator', '_kmedoids__distance_first_medoid', '_kmedoids__distance_second_medoid', '_kmedoids__erase_empty_clusters', '_kmedoids__initializer', '_kmedoids__iterations', '_kmedoids__itermax', '_kmedoids__labels', '_kmedoids__medoid_indexes', '_kmedoids__metric', '_kmedoids__pointer_data', '_kmedoids__swap_medoids', '_kmedoids__tolerance', '_kmedoids__total_deviation', '_kmedoids__update_clusters', '_kmedoids__verify_arguments', 'get_cluster_encoding', 'get_clusters', 'get_iterations', 'get_labels', 'get_medoids', 'get_total_deviation', 'predict', 'process']
Runtime: min=90973.16 mean=90973.16 ±nan
D:\programs\miniconda3\envs\pyclustering-environment\lib\site-packages\numpy\core\_methods.py:140: RuntimeWarning: Degrees of freedom <= 0 for slice
  keepdims=keepdims)
D:\programs\miniconda3\envs\pyclustering-environment\lib\site-packages\numpy\core\_methods.py:130: RuntimeWarning: invalid value encountered in true_divide
  ret, rcount, out=ret, casting='unsafe', subok=False)
Loss:    min= 4994.75 mean= 4994.75 ±nan
N Iter:  min=    2.00 mean=    2.00 ±nan

There is a big impact on the performance in data transfer procedure between Python and C++. I wanted to have absolutely independent C++ library from "python.h" and have to pay for this now. I have added additional time counter to the implementation to measure data packing time and results unpacking time on python side:

Data packing time for C++ on python side: 21.093750 sec.
Results unpacking from C++ on python side: 0.000000 sec.
Build took: {} 31.153797149658203

Thus, in case PAM BUILD: 21 seconds is data packing and 10 seconds is processing on C++ side. The same is for PAM algorithm itself, so 42 seconds are used for data transfer only. Thus, in general I would expect 90 - 42 = 48 seconds for processing.

I think I will try to implement python dependent interface and use templates in C++ in order to optimize this flow and keep C++ static library independent from python.

FastPAM is impressing me in its performance, I will implement it, when I finish classical PAM optimizations.

Feb 13 '21 15:02 annoviko

The resulting loss looks good, but the other PAMs used 3 iterations, not 2. This could be an off by one in counting, or it could be due to tolerance > 0 (IMHO, tolerance=0 is a reasonable default for "hard" clustering approaches such as k-medoids and k-means, just not for "soft" Gaussian Mixture Modeling). When I try to run the notebook on google colab, it still takes forever and when I interrupt it, it appears to run in python, not in ccore based on the trace (in __calculate_next_medoid(self)). But I am not an expert on Python, so I don't know how to resolve this; I can only mention that it does not appear to use ccore.

Feb 13 '21 15:02 kno10

The runtime on Google colab was:

Build took: 2596859.432220459
Runtime: min=8129146.14 mean=8129146.14 ±nan
Loss:    min= 4994.75 mean= 4994.75 ±nan
N Iter:  min=    2.00 mean=    2.00 ±nan

So 43 minutes for BUILD + 92 min for SWAP. You may want to add a warning when the Python version is used on large data.

Feb 13 '21 18:02 kno10

@kno10 , thanks for these results, I think it is much better to optimize python code as well, the problem that I wasn't use numpy for processing in case of PAM BUILD and PAM itself, I think it doable. I will let you know when I update both implementations of the algorithm.

About iterations, I am calculating it in the following way (pseudo-code):

if (itermax > 0) then update_clusters();

for (iterations = 0; iterations < itermax;) then
	iterations++;
	cost = swap_medoids();
	if (cost != NOTHING_TO_SWAP) then
		update_clusters();

So, I would say that I am counting the amount of swap_medoids().

Feb 14 '21 17:02 annoviko

pyclustering pyclustering copied to clipboard

Better PAM initialization with BUILD

pyclustering
pyclustering copied to clipboard