pyclustering
pyclustering copied to clipboard
Better PAM initialization with BUILD
The PAM/k-medoids implementation appears to implement SWAP, but not the BUILD part for initializing PAM. Instead you have to provide good starting medoids.
I tired benchmarking it on a larger data set (well-known 20news data set in the sklearn.datasets.fetch_20newsgroups_vectorized
version; with cosine distance and k=20) and the runtime of pyclustering kmedoids was extremely high (2000 sec), likely because of the poor random initialization.
With BUILD from the kmedoids
package, I can reduce the run time to 338 sec. Nevertheless, pam
from kmedoids
is just 37 seconds, including 4.6 seconds for BUILD. The fasterpam
variant finishes in 336 msec with random initialization.
It would also be good to get access to the final loss after optimization as well as the number of iterations. Then I could check if it ran into the maximum iteration limit, and if at least the result quality is comparable.
Hello @kno10 ,
There is a kmeans++ based algorithm to initialize it on which you are referring in your article:
In the experiments, we will also study whether a clever sampling-based approach similar to k-means++ (Arthur and Vassilvitskii, 2007) that needs only O(nk) time but produces a worse starting point is an interesting alternative.
It could be used to initialize initial medoids and I expected that only quality is going to be affected. Nevertheless your outcome looks like a good point to implement PAM BUILD
algorithm for initialization. I am not sure if it is convenient to make a part of the interface of PAM algorithm.
Also I have a question for you, did you use installed pyclustering version or just cloned version? Because in case of installed version, C++ implementation of the algorithm should be used and in case of cloned version, the library uses Python implementation due to lack of binaries in git repository.
It would also be good to get access to the final loss after optimization as well as the number of iterations
It is clear about the number of iterations. But I have question about the final loss after optimization. Do you mean total deviation (TD)?
I tried using your k-means++ initialization, but I believe it only accepts a data matrix, not a distance matrix, as input.
For that experiment I used "pip install pyclustering" on Google colab, whatever that does by default.
Yes, TD (total deviation) is the loss.
Thank you for the clarification. One more question:
I tried using your k-means++ initialization, but I believe it only accepts a data matrix, not a distance matrix, as input.
Was a distance matrix used as an input for the algorithm in the experiment?
Hello @kno10 ,
I have introduced PAM BUILD algorithm in line with your article and I have optimized C++ version (that should be used by default) of K-Medoids (PAM) algorithm. I have attached pyclustering package (pyclustering-0.11.0.tar.gz) with all binaries that was created from master
branch (not a release version) where these changes are available. Could you please check it again?
Just in case K-Means++ supports distance matrix now. Let me know if you need code example as well.
There is an example how to use K-Medoids with PAM BUILD:
from pyclustering.cluster.kmedoids import kmedoids, build
from pyclustering.cluster import cluster_visualizer
from pyclustering.utils import read_sample
from pyclustering.samples.definitions import FCPS_SAMPLES
# Load list of points `Tetra` for cluster analysis.
sample = read_sample(FCPS_SAMPLES.SAMPLE_TETRA)
# Initialize initial medoids using PAM BUILD algorithm
initial_medoids = build(sample, 4).initialize()
# Create instance of K-Medoids (PAM) algorithm.
kmedoids_instance = kmedoids(sample, initial_medoids)
# Run cluster analysis
kmedoids_instance.process()
# Display clustering results to console.
print("Clusters:", kmedoids_instance.get_clusters())
print("Labels:", kmedoids_instance.get_labels())
print("Medoids:", kmedoids_instance.get_medoids())
print("Total Deviation:", kmedoids_instance.get_total_deviation())
print("Iterations:", kmedoids_instance.get_iterations())
# Display clustering results.
visualizer = cluster_visualizer()
visualizer.append_clusters(kmedoids_instance.get_clusters(), sample)
visualizer.append_cluster(initial_medoids, sample, markersize=120, marker='*', color='dimgray')
visualizer.append_cluster(kmedoids_instance.get_medoids(), sample, markersize=120, marker='*', color='black')
visualizer.show()
Output example:
Clusters: [[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99], [100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199], [200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299], [300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376, 377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399]]
Labels: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]
Medoids: [0, 100, 201, 300]
Total Deviation: 239.25250279395595
Iterations: 3
@kno10 , I would be really pleased if you provide test code that you was using for performance testing.
@kno10 , I decided to repeat your experiment using sklearn.datasets.fetch_20newsgroups_vectorized
.
Results of the original PAM without distance matrix (list of points) still aren't good:
Medoids: [6953, 10558, 489, 1034, 6910, 3522, 10246, 2692, 1965, 9385, 6535, 7433, 6540, 2760, 459, 10860, 6233, 4006, 484, 11189]
Total Deviation: 29.767144356027746
Iterations: 102
PAM BUILD: 36.062500 sec.
PAM: 2187.406250 sec.
PAM BUILD implementation: https://github.com/annoviko/pyclustering/blob/master/ccore/src/cluster/pam_build.cpp
I was using the following code to get data and run it on PAM. Is it similar to what you was doing?
import time
from pyclustering.cluster.kmedoids import kmedoids, build
from sklearn.decomposition import TruncatedSVD
from sklearn.datasets import fetch_20newsgroups_vectorized
sample = fetch_20newsgroups_vectorized()
data = fetch_20newsgroups_vectorized()
X, y = data['data'], data['target']
X = TruncatedSVD().fit_transform(X)
t_start = time.process_time()
initial_medoids = build(X, 20).initialize()
pam_build_time = time.process_time() - t_start
t_start = time.process_time()
kmedoids_instance = kmedoids(X, initial_medoids)
kmedoids_instance.process()
pam_time = time.process_time() - t_start
print("Medoids:", kmedoids_instance.get_medoids())
print("Total Deviation:", kmedoids_instance.get_total_deviation())
print("Iterations:", kmedoids_instance.get_iterations())
print("PAM BUILD: %f sec." % pam_build_time)
print("PAM: %f sec." % pam_time)
Please see https://colab.research.google.com/drive/1DNzGbQns5-kiyTVkDvAorcxqXZb5ukEI?usp=sharing It still appears to take a very long time. It does not appear to use ccore?
@kno10 , thanks, your performance test is different and it results look much better. By default pyclustering always uses C++ code. I have run it my local machine and got the following results. Could you please take a look at them?
(11314, 130107)
(11314, 11314) took 7.680890083312988 s
['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_build__amount', '_build__calculate_first_medoid', '_build__calculate_next_medoid', '_build__ccore', '_build__create_distance_calculator', '_build__data', '_build__data_type', '_build__distance_calculator', '_build__initialize_by_ccore', '_build__initialize_by_python', '_build__metric', '_build__verify_arguments', 'initialize']
Build took: {} 37.16072940826416
<pyclustering.cluster.kmedoids.kmedoids object at 0x000001B85C41F780> ['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_kmedoids__calculate_swap_cost', '_kmedoids__ccore', '_kmedoids__clusters', '_kmedoids__create_distance_calculator', '_kmedoids__data_type', '_kmedoids__distance_calculator', '_kmedoids__distance_first_medoid', '_kmedoids__distance_second_medoid', '_kmedoids__erase_empty_clusters', '_kmedoids__initializer', '_kmedoids__iterations', '_kmedoids__itermax', '_kmedoids__labels', '_kmedoids__medoid_indexes', '_kmedoids__metric', '_kmedoids__pointer_data', '_kmedoids__swap_medoids', '_kmedoids__tolerance', '_kmedoids__total_deviation', '_kmedoids__update_clusters', '_kmedoids__verify_arguments', 'get_cluster_encoding', 'get_clusters', 'get_iterations', 'get_labels', 'get_medoids', 'get_total_deviation', 'predict', 'process']
Runtime: min=90973.16 mean=90973.16 ±nan
D:\programs\miniconda3\envs\pyclustering-environment\lib\site-packages\numpy\core\_methods.py:140: RuntimeWarning: Degrees of freedom <= 0 for slice
keepdims=keepdims)
D:\programs\miniconda3\envs\pyclustering-environment\lib\site-packages\numpy\core\_methods.py:130: RuntimeWarning: invalid value encountered in true_divide
ret, rcount, out=ret, casting='unsafe', subok=False)
Loss: min= 4994.75 mean= 4994.75 ±nan
N Iter: min= 2.00 mean= 2.00 ±nan
There is a big impact on the performance in data transfer procedure between Python and C++. I wanted to have absolutely independent C++ library from "python.h" and have to pay for this now. I have added additional time counter to the implementation to measure data packing time and results unpacking time on python side:
Data packing time for C++ on python side: 21.093750 sec.
Results unpacking from C++ on python side: 0.000000 sec.
Build took: {} 31.153797149658203
Thus, in case PAM BUILD: 21
seconds is data packing and 10
seconds is processing on C++ side. The same is for PAM algorithm itself, so 42
seconds are used for data transfer only. Thus, in general I would expect 90 - 42 = 48 seconds
for processing.
I think I will try to implement python dependent interface and use templates in C++ in order to optimize this flow and keep C++ static library independent from python.
FastPAM is impressing me in its performance, I will implement it, when I finish classical PAM optimizations.
The resulting loss looks good, but the other PAMs used 3 iterations, not 2. This could be an off by one in counting, or it could be due to tolerance > 0 (IMHO, tolerance=0 is a reasonable default for "hard" clustering approaches such as k-medoids and k-means, just not for "soft" Gaussian Mixture Modeling). When I try to run the notebook on google colab, it still takes forever and when I interrupt it, it appears to run in python, not in ccore based on the trace (in __calculate_next_medoid(self)). But I am not an expert on Python, so I don't know how to resolve this; I can only mention that it does not appear to use ccore.
The runtime on Google colab was:
Build took: 2596859.432220459
Runtime: min=8129146.14 mean=8129146.14 ±nan
Loss: min= 4994.75 mean= 4994.75 ±nan
N Iter: min= 2.00 mean= 2.00 ±nan
So 43 minutes for BUILD + 92 min for SWAP. You may want to add a warning when the Python version is used on large data.
@kno10 , thanks for these results, I think it is much better to optimize python code as well, the problem that I wasn't use numpy for processing in case of PAM BUILD and PAM itself, I think it doable. I will let you know when I update both implementations of the algorithm.
About iterations, I am calculating it in the following way (pseudo-code):
if (itermax > 0) then update_clusters();
for (iterations = 0; iterations < itermax;) then
iterations++;
cost = swap_medoids();
if (cost != NOTHING_TO_SWAP) then
update_clusters();
So, I would say that I am counting the amount of swap_medoids()
.