new functionalities for High Dimensionality problem and improved performance
New functionalities for High Dimensionality problem and improved performance:
The improvements achieved on cluster library are related with :
- High Dimensionality problem
- improved performance, making clustering linear
High dimensionality (HD) problems arºe those which have items with high number of dimensions. There are two types of HD problems:
a)set of items with large number of dimensions. b)set of items with a limited number of dimensions from a large available number of dimensions:
For example considering dimensions X, Y, Z, K, L, M and the items: item1=(X=2, Z=5, L=7) item2=(X=6, Y=5, M=7)
The HD problems involves a high cost computation because distance functions in this case takes more operations than Low dimensionality problems.
For case "b" (valid also for "a"), a new distance for HD problems is available: HDdistItems() ,HDequals() This distance function compares dimensions between 2 items. Each dimension of item1 is searched in item2, and if it is found, then the distance takes into account the difference (mahatan style) if the dimension does not exist in item2, a maximum value is added to the total distance between item1 and item2.
there is no difference with current usage::
cl = KMeansClustering(users,HDdistItems,HDequals);
Additionally, now the number of iterations can be limited in order to save time Experimentally, we have concluded that 10 iterations is enough accurate for most cases. The new HDgetClusters() function is linear. Avoid the recalculation of centroids whereas original function getClusters() is N*N complex, because recalculate the centroid when move an item from one cluster to another. This new function can be used for low and high dimensionality problems, increasing performance in both cases::
solution = cl.HDgetclusters(numclusters,max_iterations);
Other new available optimization inside HDcentroid() function in is the use of mean instead median at centroid calculation. median is more accurate but involves more computations when N is huge. The function HDcentroid() is invoked internally by HDgetclusters()
I'm currently at EuroPython. I'll look into it as soon as possible. In case you're here as well, we could meet up. On 20 Jul 2015 15:24, "jjaranda13" [email protected] wrote:
New functionalities for High Dimensionality problem and improved performance:
The improvements achieved on cluster library are related with :
- High Dimensionality problem
- improved performance, making clustering linear
High dimensionality (HD) problems arºe those which have items with high number of dimensions. There are two types of HD problems:
a)set of items with large number of dimensions. b)set of items with a limited number of dimensions from a large available number of dimensions:
For example considering dimensions X, Y, Z, K, L, M and the items: item1=(X=2, Z=5, L=7) item2=(X=6, Y=5, M=7)
The HD problems involves a high cost computation because distance functions in this case takes more operations than Low dimensionality problems.
For case "b" (valid also for "a"), a new distance for HD problems is available: HDdistItems() ,HDequals() This distance function compares dimensions between 2 items. Each dimension of item1 is searched in item2, and if it is found, then the distance takes into account the difference (mahatan style) if the dimension does not exist in item2, a maximum value is added to the total distance between item1 and item2.
there is no difference with current usage::
cl = KMeansClustering(users,HDdistItems,HDequals);
Additionally, now the number of iterations can be limited in order to save time Experimentally, we have concluded that 10 iterations is enough accurate for most cases. The new HDgetClusters() function is linear. Avoid the recalculation of centroids whereas original function getClusters() is N*N complex, because recalculate the centroid when move an item from one cluster to another. This new function can be used for low and high dimensionality problems, increasing performance in both cases::
solution = cl.HDgetclusters(numclusters,max_iterations);
Other new available optimization inside HDcentroid() function in is the use of mean instead median at centroid calculation. median is more accurate but involves more computations when N is huge.
The function HDcentroid() is invoked internally by HDgetclusters()
You can view, comment on, or merge this pull request online at:
https://github.com/exhuma/python-cluster/pull/19 Commit Summary
- Update README.rst
- Update README.rst
- Update README.rst
- Update README.rst
- Update README.rst
- Update README.rst
- Update AUTHORS
- Update util.py
- Update kmeans.py
- Update util.py
- Create HDdistances.py
- Create HDexample.py
- Update HDexample.py
File Changes
- M AUTHORS https://github.com/exhuma/python-cluster/pull/19/files#diff-0 (6)
- A HDexample.py https://github.com/exhuma/python-cluster/pull/19/files#diff-1 (131)
- M README.rst https://github.com/exhuma/python-cluster/pull/19/files#diff-2 (48)
- A cluster/HDdistances.py https://github.com/exhuma/python-cluster/pull/19/files#diff-3 (71)
- M cluster/method/kmeans.py https://github.com/exhuma/python-cluster/pull/19/files#diff-4 (107)
- M cluster/util.py https://github.com/exhuma/python-cluster/pull/19/files#diff-5 (37)
Patch Links:
- https://github.com/exhuma/python-cluster/pull/19.patch
- https://github.com/exhuma/python-cluster/pull/19.diff
— Reply to this email directly or view it on GitHub https://github.com/exhuma/python-cluster/pull/19.
This pull requests is only for add a change provided by my colleage (Juan Ramos)
it is a pity but i am not in europython :-(
Sorry for the very late reply. I've had a very crazy year 2017... it's slowly calming down and I will review this in the coming days.
Michel
Thanks very much for getting to this.
I've upgraded my environment to use version 1.4.1 of cluster and I can confirm it behaves identically to my hacked version.
This was very timely for me, as I am just about to package my own work for publication on PyPi.
Thanks again for a great library
Tim
On Mon, May 14, 2018 at 2:46 AM, Michel Albert [email protected] wrote:
Sorry for the very late reply. I've had a very crazy year 2017... it's slowly calming down and I will review this in the coming days.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/exhuma/python-cluster/pull/19#issuecomment-388647505, or mute the thread https://github.com/notifications/unsubscribe-auth/AE5Hp0v6ta5VKk0NIw4gwM4KFchxlVOxks5tyH-GgaJpZM4Fb7x_ .
Sorry for the late reply. I did not get any update e-mails from github :(
change in util.py done
I had another look at the code, and unfortunately the logic is too tightly coupled to your application logic ("users" and "keywords"). This means that the changes would only apply and work for your application.
I would be willing to work on this together if you are still interested in getting the changes into the library.
The main change would be to extract the application logic from the new functions and expose them via the function arguments.
I had another look at the code, and unfortunately the logic is too tightly coupled to your application logic ("users" and "keywords"). This means that the changes would only apply and work for your application.
I would be willing to work on this together if you are still interested in getting the changes into the library.
The main change would be to extract the application logic from the new functions and expose them via the function arguments.
Hi Michel Albert
Thanks for your feedback and interest. We (juan Ramos and me) have analyzed your issue and we propose you a possible solution. Let us know if you like it:
"users" will not appear anymore. In its place, we will put "items". A "user" is a "profile" composed by a certain number of pairs (keyword, weight). We can replace them by pairs of (dimension, value) In a nutshell: users-> items keyword-> dimension weight -> value
these changes, in combination with according changes on comments, could be considered a generic approach. If you like it, we will modify quickly (in one day)
best regards