scikit-learn-intelex icon indicating copy to clipboard operation
scikit-learn-intelex copied to clipboard

K-means should stop at convergence point

Open ReveStobinson opened this issue 2 years ago • 2 comments

The daal4py kmeans function accepts an argument maxIterations. It is implied that this will bound the maximum number of updates that the algorithm will undergo in an attempt to find a best solution. The kmeans_result object that is returned by this function also has an nIterations attribute, which is supposed to represent the actual number of iterations taken.

I've noticed that, while most kmeans algorithm implementations have some sort of check for convergence, it appears that the daal4py implementation either does not have such a check, or that check doesn't actually function to stop iterating the clustering algorithm.

You can see this for yourself by using this short script that I adapted from the python batch processing example, which generates some points and then clusters them. If you change the maxIter variable, you will see that the script takes much longer to run, despite giving the exact same results (i.e., it has converged!), and you will see that the output of nIterations will always be the same as the input maxIter.

import daal4py as d4p
import numpy as np

def randomvector():
    return np.random.rand(3)*1000

vectors = []
for __ in range(10):
    v = randomvector()
    vectors.append(v)

vectors = np.asarray(vectors)

nClusters = 3
maxIter = 1234

initrain_algo = d4p.kmeans_init(nClusters, method="randomDense")
# Load the data
data = vectors
# compute initial centroids
initrain_result = initrain_algo.compute(data)
# The results provides the initial centroids
assert initrain_result.centroids.shape[0] == nClusters

# configure kmeans main object: we also request the cluster assignments
algo = d4p.kmeans(nClusters, maxIter, assignFlag=True)
# compute the clusters/centroids
result = algo.compute(data, initrain_result.centroids)

assert result.centroids.shape[0] == nClusters
assert result.assignments.shape == (data.shape[0], 1)
assert result.nIterations <= maxIter

print("\nFirst 10 cluster assignments:\n", result.assignments[0:10])
print("\nFirst 10 dimensions of centroids:\n", result.centroids[:, 0:10])
print("\nObjective function value:\n", result.objectiveFunction)
print("\nActual number of iterations taken:", result.nIterations)
print('All looks good!')

I initially thought it possible that nIterations could have been errantly just referencing the input maxIter, despite the actual number being different, but I don't think that's the case, since increasing maxIter always makes the run take longer (I could profile it, but there was really no need, it was timeable with a stopwatch).

What I believe is actually happening is that the algorithm is continuing to update, despite having already reached convergence. It's possible this could also be due to precision errors, but if that's the case it is still a bug.

If a k-means solution is at any point the same as the solution from the previous iteration, the model will not converge any further due to the nature of the algorithm, and so the process should be stopped. If that is for some reason not the desired behavior for this kmeans implementation, then the maxIterations argument should simply be changed to iterations, to avoid the implication that the algorithm could converge prior to that number and stop iterating.

ReveStobinson avatar Feb 17 '22 22:02 ReveStobinson

Hi @ReveStobinson, thanks for creating an issue, the problem will be analyzed.

agorshk avatar Mar 04 '22 06:03 agorshk

Hi @ReveStobinson, in daal4py kmeans parameter accuracyThreshold – threshold for the termination of the algorithm defaulted to get_nan64() and so the algorithm goes through all iterations. If accuracyThreshold set to 1e-4, like scikit-learn, algorithm stop early.

import daal4py as d4p
import numpy as np

np.random.seed(777)

def randomvector():
    return np.random.rand(3)*1000

vectors = []
for __ in range(10):
    v = randomvector()
    vectors.append(v)

vectors = np.asarray(vectors)

nClusters = 3
maxIter = 1234

initrain_algo = d4p.kmeans_init(nClusters, method="randomDense")
# Load the data
data = vectors
# compute initial centroids
initrain_result = initrain_algo.compute(data)
# The results provides the initial centroids
assert initrain_result.centroids.shape[0] == nClusters

# configure kmeans main object: we also request the cluster assignments
algo = d4p.kmeans(nClusters, maxIter, assignFlag=True, accuracyThreshold = 1e-4)
# compute the clusters/centroids
result = algo.compute(data, initrain_result.centroids)

assert result.centroids.shape[0] == nClusters
assert result.assignments.shape == (data.shape[0], 1)
assert result.nIterations <= maxIter

print("\nFirst 10 cluster assignments:\n", result.assignments[0:10])
print("\nFirst 10 dimensions of centroids:\n", result.centroids[:, 0:10])
print("\nObjective function value:\n", result.objectiveFunction)
print("\nActual number of iterations taken:", result.nIterations)
print('All looks good!')

The output:

First 10 cluster assignments:
 [[2]
 [1]
 [1]
 [2]
 [0]
 [1]
 [2]
 [2]
 [1]
 [0]]

First 10 dimensions of centroids:
 [[421.36634883 261.63542248 902.6403091 ]
 [606.05836588 792.08433335 592.69943807]
 [312.9989213  289.90933661 139.45320124]]

Objective function value:
 [[591561.43638524]]

Actual number of iterations taken: [[2]]
All looks good!

lordoz234 avatar Mar 05 '22 11:03 lordoz234