JSAT icon indicating copy to clipboard operation
JSAT copied to clipboard

PAM does not implement PAM

Open kno10 opened this issue 4 years ago • 18 comments

The PAM algorithm is quite different from a k-means style approach you implemented.

Any idea what reference you used? I am trying to figure out why so many use the name "PAM" for a different algorithm than in the original work. Is it incorrect in some book?

kno10 avatar Dec 26 '19 23:12 kno10

Can you provide a link/reference to what you think is the correct way to do PAM?

EdwardRaff avatar Dec 27 '19 00:12 EdwardRaff

BUILD and SWAP, as in the PAM paper.

Kaufman, L. and Rousseeuw, P.J. (1987), Clustering by means of Medoids, in Statistical Data Analysis Based on the L1–Norm and Related Methods, edited by Y. Dodge, North-Holland, 405–416.

BUILD is a greedy initialization that always adds the medoid that reduces TD most. SWAP searches the pair of objects (medoid,non-medoid), such that replacing an existing medoid with a non-medoid (hence keeping the number of clusters constant) yields the maximum decrease in TD; so similar to a steepest-descent approach. When no decrease can be found, the algorithm stops. As you can see, this is fairly different from k-means.

This finds much better results than the naive k-means-style approach.

kno10 avatar Dec 27 '19 01:12 kno10

I think what is in the paper is equivalent to the code I have for PAM.

BUILD basically describes just the seed selection step, where the first seed is the medoid, and subsequent seeds are selected in a kind of greedy fashion (its not clear to me from the original paper exactly how that happens). While I don't have this implemented in the exact same style, I would suspect Kmeans++ selection to produce good results of comparable initialization.

the SWAP selects points which should move clusters with the most "profitable" pair, and "do not depend upon the order", which means its not a greedy selection updated on-the-fly, but after checking all points. This is what my current code does with the "k-means" style as you call it.

Ignoring the exact nature of the BUILD step, since its not super precisely specified, can you help me understand a scenario where what is implemented does not produce the same result of PAM as described in the paper?

EdwardRaff avatar Dec 27 '19 02:12 EdwardRaff

No, it is not equivalent. The SWAP procedure can reassign points to other clusters when swapping. Hence it finds much better improvements, in my experiments the results were even significantly better. Because of the discrete nature of the median, the k-means-style approach does not work well; it gets stuck easily. BUILD is very clearly specified - greedily add the next medoid that minimizes TD - and it is also very important. BUILD itself usually finds a better solution than the entire k-means-style procedure, and the PAM authors suggested that often you will not need to run SWAP at all. If you are missing detail in the first paper, you can check out the corresponding chapter in the later book: http://dx.doi.org/10.1002/9780470316801.ch2

kno10 avatar Dec 27 '19 12:12 kno10

My institution does not have access to Willey, do you have any other references on the details? I find the original PAM paper very hard to parse on what the details are.

I’m still not understanding the difference. In the current code we re-assign points to their closets clustered, do ownership changes, then medoids are recomputed. The paper says order does not matter, so it sounds like it’s reassigning points to nearest clusters (“most profitable” as they state), and then we need to recompute medoids for things to make sense, yes?

Sent from my iPhone

On Dec 27, 2019, at 7:16 AM, Erich Schubert [email protected] wrote:

 No, it is not equivalent. The SWAP procedure can reassign points to other clusters when swapping. Hence it finds much better improvements, in my experiments the results were even significantly better. Because of the discrete nature of the median, the k-means-style approach does not work well; it gets stuck easily. BUILD is very clearly specified - greedily add the next medoid that minimizes TD - and it is also very important. BUILD itself usually finds a better solution than the entire k-means-style procedure, and the PAM authors suggested that often you will not need to run SWAP at all. If you are missing detail in the first paper, you can check out the corresponding chapter in the later book: http://dx.doi.org/10.1002/9780470316801.ch2

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

EdwardRaff avatar Dec 27 '19 14:12 EdwardRaff

The new medoid will always be chosen as to cover all currently assigned points. This misses good medoid candidates when points can be easily moved to neighboring clusters instead. The SWAP procedure does not have this restriction, hence it finds better (and more) changes. Just benchmark. Get one of the orlib data sets such as kmed20, and try running your procedure, compute the cost. Then try the original PAM from the R cluster package (eventually translated from Fortran to C, but derived from the original code), and it will find substantially better solutions. I am preparing a paper right now that will include a very simple example for this (which should also prove that in the worst case, the k-means-style approach can be arbitrarily worse than SWAP).

kno10 avatar Dec 27 '19 17:12 kno10

Erich, I’m just not on the same page in understanding how the SWAP procedure works. If your paper has this in more explicit detail I’m happy to add it to my reading list.

At the moment, I don’t plan on doing the tests you suggest. It’s a good amount of work and I’ve got other items higher on my property list for JSAT. Especially since TRIKMEDS is implemented with the same kind of approach as the current PAM, but with n sqrt(n) complexity. Their paper also mentions that it is the same approach as PAM, so I'm getting conflicting information on this.

If you have tested this previously and have this information well documented, and are interested in doing a pull request, I'm happy to work with you on that!

EdwardRaff avatar Dec 28 '19 17:12 EdwardRaff

My suggestion is to rename the class for now, and adjust documentation, to make clear it is not PAM.

The algorithm you implemented is known in facility location literature by the name "alternate", because it has these two alternating steps (and e.g. Teitz and Bart 1968, but also several other authors in that domain observed that it performs rather poor). So AlternateKMedians may be an appropriate name; or just KMedians, because that is how the ESL book introduced kmedians as a variation of kmeans (they don't talk about quality, though...).

In my experiments, on the ORlib problems, the "alternate" algorithm with k-means++ initialization at 37%. BUILD alone is 3.14% random, BUILD+PAM is 0.51%; where 0% means always finding the optimum solution (which is known for the ORlib problems), and where 100% is the average of 100x picking medoids randomly. Doing the k-means++-Alternate combination 10 times, and keeping the best, gets you down to 9% on average, but there is a large standard deviation on the data sets, too.

kno10 avatar Dec 28 '19 19:12 kno10

Thanks for pointing out the TRIKMEDS paper. I wasn't aware of it, but I have seen other papers of the authors, and they were very good; based on that and the abstract I can imagine how they use metric bounds here. They also confirm some observation that I made (that Park's initialization performs poorly), so I'll definitely cite this. I cannot include it in my current experiments, because my data is given as a graph, and supposedly does not fulfill the triangle inequality. They only mention PAM once, and do not claim they are doing the same. On the contrary, they write "In this paper we do not compare cluster qualities of previous algorithms, but focus on accelerating the lloyd equivalent for K-medoids"; so they are aware that PAM works differently. Also beware that "exact" only means they find the exact 1-medoid of each partition, not the optimum k-medoids partitioning which is NP-hard. In the (cited) NIPS 2017 paper "K-Medoids For K-Means Seeding" they observe that CLARANS yields significantly better results than lloyd-style k-medoids.

kno10 avatar Dec 28 '19 21:12 kno10

I just noticed that they also give an example (technically not really for k-medoids though) where CLARANS can reach a better optimum by swapping than the lloyd-style approach. They use it for motivating to use CLARANS for initializing k-means. Figure 1 in https://papers.nips.cc/paper/7104-k-medoids-for-k-means-seeding.pdf

kno10 avatar Dec 28 '19 21:12 kno10

I am not going to introduce a breaking change like renaming a file without understanding the situation.

I really appreciate your interest, but I am simply not making any changes until I understand what is "wrong". Again, if you are aware of any resources that provide more detailed math/pesudo code as to the nature of SWAP and BUILD, I'm happy to add them to my reading list. I'm afraid from your descriptions in this thread/the original paper, I don't understand what the distinction is on SWAP or how I would implement it. Perhaps your background/training makes this all very obvious to you, and if you would like to write code to fix this and show the difference in results, I'm happy to work with you on merging it. But it is simply not obviously/clear to me from what I have read thus far.

EdwardRaff avatar Dec 29 '19 18:12 EdwardRaff

  • The Wikipedia article describes swap in very easy language: https://en.wikipedia.org/wiki/K-medoids It's literally trying every combination of medoid and non-medoid and check which swap (making the medoid a regular data point, and using the regular data point as new medoid instead) yields the best decrease in cost, reassigning all points during this process (not as with k-means only considering points in the current cluster). But it doesn't discuss how to implement it efficiently.
  • The classic textbook Han, Pei, Kamber "Data Mining: Concepts and Techniques" also explains PAM briefly, but not completely correct (its not randomly picking objects, its not random started). It discusses the different cases that you need to handle.
  • The CLARANS paper also includes a detailed (2 page?) explanation of the PAM (because they do a randomized approximation to PAM), again they ignore the starting conditions not being random: Ng, R. T., & Han, J. (2002). CLARANS: A method for clustering objects for spatial data mining. IEEE Transactions on Knowledge & Data Engineering, (5), 1003-1016.

Likely the best source remains the book by the authors... I'm sure you can find a way to access it.

Given that your "PAM" currently isn't "PAM", I'd argue that renaming wouldn't break more than what already is broken. ;-) In particular since the quality of the results is pretty poor with the k-means-style algorithm.

kno10 avatar Dec 29 '19 20:12 kno10

I read the wiki page, and did not find it to be very comprehensible. But what you've just written is much more cogent. However, it also does not match what is listed as the PAM algorithm in the book you just mentioned, which I copy below: Screen Shot 2019-12-29 at 3 28 30 PM

The above also doesn't match the verbiage of the original paper which states order does not matter, as this is a randomized process.

EdwardRaff avatar Dec 29 '19 20:12 EdwardRaff

Yes, PAM tries all of them, not randomly; and it doesn't include BUILD. Randomly is what CLARANS does. I only checked the google books preview, and saw that it discusses some four cases that you need to handle when computing the cost in line (5).

kno10 avatar Dec 29 '19 20:12 kno10

But you see how many wrong versions of PAM are in literature and software; which is really annoying when you want to benchmark and compare them, and half of them turn out to be a different algorithm, which is fast but yields much worse results.

kno10 avatar Dec 29 '19 20:12 kno10

If you are trying to benchmark and compare, I totally understand the frustration. But please also understand from my side, I have no clarity one what is the real PAM at the moment. Should I trust you, or this book, or poorly worded original paper, or some other paper? So I will not be changing anything for some time as I both 1) do my own research on this, and 2) adress other items on my TODO list that are higher priorities.

EdwardRaff avatar Dec 29 '19 20:12 EdwardRaff

You can check out the R version that is as far as I can tell derived from the original Fortran code, but now includes extensions (pamonce >0) and is not very readable (likely because of fortran-to-c code style, as you can tell from the use of "goto L60"). https://github.com/cran/cluster/blob/45695c0f1232711cd34bcaeba975f8a552e69f52/src/pam.c#L428-L445 https://github.com/cran/cluster/blob/45695c0f1232711cd34bcaeba975f8a552e69f52/src/pam.c#L469-L490 Or you can go to the actual Fortran version: https://github.com/cran/cluster/blob/25f9de803433055c4632edf5f1997758b73970b8/src/pam.f There you can see exactly what BUILD and SWAP are supposed to do.

kno10 avatar Dec 29 '19 20:12 kno10

And sorry, I beg to disagree that the original paper is unclear or poorly worded. Page 7 describes the algorithm quite precisely; there just is no pseudocode. It clearly states that BUILD selects the object "which decreases the sum (over all objects) of the dissimilarities to the most similar selected object as much as possible", i.e. the one which minimizes the total deviation if you add it to the medoid set. Similarly for SWAP, it states "considering all pairs of object (i,h) for which object i has been selected and h has not. It is determined what effect is obtained when a swap is carried out; i.e. when i is no longer selected as representative object but object h is. The most profitable pair is then indeed swapped." What seems to be missing from a quick glance is the discussion how to implement this more efficiently than for each of the k*(n-k) combinations computing the cost in k*(n-k), but in 'just' O(k(n-k)²) by storing the nearest and second nearest medoid each.

kno10 avatar Dec 29 '19 20:12 kno10