scikit-learn-intelex icon indicating copy to clipboard operation
scikit-learn-intelex copied to clipboard

daal4py patches make sklearn.preprocessing.tests.test_discretization::test_nonuniform_strategies fail

Open oleksandr-pavlyk opened this issue 5 years ago • 1 comments

Using scikit-learn 0.20.3, run

python -m daal4py -m pytest --disable-warnings --pyargs sklearn.preprocessing.tests.test_discretization::test_nonuniform_strategies

The test fails for strategy='kmeans' for 5 bins.

The following script replicates the issue:

# km.py
import numpy as np
import daal4py
import sklearn.cluster

X = np.array([0, 0.5, 2, 3, 9, 10]).reshape(-1, 1)
init = np.array([1, 3, 5, 7, 9], dtype=np.double).reshape(-1,1)

init_labels_ = np.array([np.argmin(np.square(X[k] - init).sum(axis=-1)) for k in range(X.shape[0])])
init_inertia_ = np.square(X - init[init_labels_]).sum()

cl0 = sklearn.cluster.KMeans(n_clusters = 5, n_init=1, init=init, algorithm='full', tol=1e-8, max_iter=1)
cl0.cluster_centers_ = init
assert np.all(np.equal(cl0.predict(X), init_labels_))

print("Initial configuration stats")
print("cluster_centers = ", init)
print("labels ", init_labels_)
print("inertia " , init_inertia_)

for k in range(1, 4):
    print("==" * 80)
    cl = sklearn.cluster.KMeans(n_clusters = 5, n_init=1, init=init, algorithm='full', tol=1e-8, max_iter=k).fit(X)

    print("cluster_centers = ",cl.cluster_centers_)
    print("labels ", cl.labels_)
    print("inertia ", cl.inertia_, np.square(X - cl.cluster_centers_[cl.labels_]).sum())
    print(cl.n_iter_)

Executing without patches python km.py and with patches as python -m daal4py km.py reveals the following

{{'iter': 0, 
    'orig': {'cc': np.array([[1],[3],[5],[7], [9]]), 'inertia': 3.25 }, 
    'daal4py' : {'cc': np.array([[1],[3],[5],[7], [9]]), 'inertia': 3.25}},
 {'iter': 1,
     'orig': {'cc': np.array([[5/6],[3],[10],[2], [9.5]]), 'inertia': 1 + 5/90. },
     'daal4py': {'cc': np.array([[5/6],[3],[0],[2], [9.5]]) , 'inertia': 6/10 + 1/90 }},  # daal4py choice seems better
{'iter': 2,
    'orig': {'cc': np.array([[ 1/4 ],[3],[10],[2], [9]]), 'inertia': 0.125},    # end configuration ends up better
    'daal4py': {'cc': np.array([[ 0.5 ],[3],[0],[2], [9.5]]), 'inertia': 0.5 }},
}

The discrepancy between two runs in how algorithms handle vacuous clusters. Label assignment for the initial configuration is [0, 0, 0, 1, 4, 4], meaning that centers 2 (initial position [5]) and 3 (initial position [7]) are vacuous.

Scikit-learn assigns observation [10] to center 2, and observation [2] for center 3. Daal4py assigns observations [0] to center 2, and observation [2] for center 3.

@jeremiedbb Could you comment, please. My stance would be that scikit-learn should not use denser dataset, which would avoid the issue of vacuous clusters for KMeans in the test for discretizer.

oleksandr-pavlyk avatar Mar 08 '19 15:03 oleksandr-pavlyk

I encountered the same issue. The 5 bins test was added in 0.20.3.

I agree that the test might be a bit ambiguous because it involves empty clusters and ties in distances.

However, it's not wrong, assuming the strategy for relocating empty clusters that sklearn uses.

The strategy is as follows:

  • once the labels are computed, count how many empty clusters, say N

  • compute the distances between each point and its cluster center (before the center is updated)

  • For the N points most far away from their cluster center, in decending order, relocate a center, in ascending order of their indices, to this point, and remove this point from it's previous cluster.

I'm not found of this test either, however the discretizer will most likely found empty clusters in every situation due to the initial centers it produces. So I think we need to have a deterministic way of handling empty clusters if we want to keep consistency between sklearn vanilla and daal4py.

On 08/03/2019 16:27, Oleksandr Pavlyk wrote:

Using scikit-learn 0.20.3, run

|python -m daal4py -m pytest --disable-warnings --pyargs sklearn.preprocessing.tests.test_discretization::test_nonuniform_strategies |

The test fails for strategy='kmeans' for 5 bins.

The following script replicates the issue:

km.py

import numpyas np import daal4py import sklearn.cluster

X= np.array([0,0.5,2,3,9,10]).reshape(-1,1) init= np.array([1,3,5,7,9],dtype=np.double).reshape(-1,1)

init_labels_= np.array([np.argmin(np.square(X[k]- init).sum(axis=-1))for kin range(X.shape[0])]) init_inertia_= np.square(X- init[init_labels_]).sum()

cl0= sklearn.cluster.KMeans(n_clusters = 5,n_init=1,init=init,algorithm='full',tol=1e-8,max_iter=1) cl0.cluster_centers_= init assert np.all(np.equal(cl0.predict(X), init_labels_))

print("Initial configuration stats") print("cluster_centers = ", init) print("labels ", init_labels_) print("inertia " , init_inertia_)

for kin range(1,4): print("==" * 80) cl= sklearn.cluster.KMeans(n_clusters = 5,n_init=1,init=init,algorithm='full',tol=1e-8,max_iter=k).fit(X)

 print("cluster_centers = ",cl.cluster_centers_)
 print("labels ", cl.labels_)
 print("inertia ", cl.inertia_, np.square(X-  cl.cluster_centers_[cl.labels_]).sum())
 print(cl.n_iter_)

Executing without patches |python km.py| and with patches as |python -m daal4py km.py| reveals the following

|{{'iter': 0, 'orig': {'cc': np.array([[1],[3],[5],[7], [9]]), 'inertia': 3.25 }, 'daal4py' : {'cc': np.array([[1],[3],[5],[7], [9]]), 'inertia': 3.25}}, {'iter': 1, 'orig': {'cc': np.array([[5/6],[3],[10],[2], [9.5]]), 'inertia': 1 + 5/90. }, 'daal4py': {'cc': np.array([[5/6],[3],[0],[2], [9.5]]) , 'inertia': 6/10 + 1/90 }}, # daal4py choice seems better {'iter': 2, 'orig': {'cc': np.array([[ 1/4 ],[3],[10],[2], [9]]), 'inertia': 0.125}, # end configuration ends up better 'daal4py': {'cc': np.array([[ 0.5 ],[3],[0],[2], [9.5]]), 'inertia': 0.5 }}, } |

The discrepancy between two runs in how algorithms handle vacuous clusters. Label assignment for the initial configuration is |[0, 0, 0, 1, 4, 4]|, meaning that centers 2 (initial position [5]) and 3 (initial position [7]) are vacuous.

Scikit-learn assigns observation |[10]| to center 2, and observation |[2]| for center 3. Daal4py assigns observations |[0]| to center 2, and observation |[2]| for center 3.

@jeremiedbb https://github.com/jeremiedbb Could you comment, please. My stance would be that scikit-learn should not use denser dataset, which would avoid the issue of vacuous clusters for KMeans in the test for discretizer.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/IntelPython/daal4py/issues/69, or mute the thread https://github.com/notifications/unsubscribe-auth/AhDVvd0j79GIPLFdtcwDRS6yC3FgjIEiks5vUoFwgaJpZM4blkVP.

jeremiedbb avatar Mar 08 '19 15:03 jeremiedbb