R icon indicating copy to clipboard operation
R copied to clipboard

Feat : k_means Algorithm

Open MeKaustubh07 opened this issue 2 months ago • 0 comments

K-Means Clustering Algorithm – Summary Review

Overview

K-Means is a centroid-based clustering algorithm that divides n data points into k clusters using centroids (means). It minimizes within-cluster variance and is known for its simplicity and speed.


Algorithm Summary

  1. Initialize: Choose k centroids (random or k-means++)
  2. Assign: Each point → nearest centroid
  3. Update: Recalculate centroids as mean of assigned points
  4. Repeat: Until centroids stabilize or max iterations reached

Key Traits:

  • Uses centroids (means)
  • Minimizes intra-cluster distance
  • Iterative refinement until convergence

Complexity

  • Time: O(n × k × d × i)
  • Space: O(n × d + k × d)
  • Scalability: Excellent for large datasets

Initialization Methods

  • K-Means++ (Recommended): Faster convergence, O(n × k × d)
  • Random: Simpler but less stable
  • Custom: User-defined centroids

Quality Metrics

Metric Formula / Concept Ideal Value
Inertia Σ
Silhouette Score (b(i) - a(i)) / max(a(i), b(i)) Close to 1
Davies-Bouldin Index Avg cluster similarity ratio Lower = better
Calinski-Harabasz Index Between / Within variance ratio Higher = better

Key Methods

Method Description
fit(X) Train model
predict(X) Assign clusters
fit_predict(X) Train + predict
transform(X) Distances to centroids
silhouette_score() Cluster quality
get_centroids() Retrieve final centroids

Advantages ✅

  1. Fast & Scalable — Efficient for large datasets
  2. Simple & Intuitive — Easy to interpret geometrically
  3. Guaranteed Convergence — Always reaches a local minimum
  4. Flexible — Works in any number of dimensions
  5. Memory Efficient — Linear space complexity

Disadvantages ❌

  1. Requires predefined k
  2. Sensitive to initialization (mitigated by k-means++)
  3. Assumes spherical clusters
  4. Sensitive to outliers
  5. Can converge to local minima
  6. Scale-dependent — Requires feature standardization

Best Practices

  • Standardize data: X_scaled <- scale(X)
  • Find optimal k: Use elbow or silhouette method
  • Use k-means++: Better initialization
  • Set tolerance: tol = 1e-4, limit iterations with max_iter = 300

Use Cases

✅ Customer segmentation
✅ Image compression (color quantization)
✅ Document clustering
✅ Anomaly detection
✅ Exploratory data analysis

❌ Avoid for: non-spherical clusters, mixed data, or many outliers


Performance Tips

  • Large datasets: Use Mini-Batch K-Means for faster results
  • High dimensions: Apply PCA or feature selection
  • Early stopping: Use relaxed tolerance (e.g., tol = 1e-3)

Comparison with Other Algorithms

Algorithm Time Cluster Shape Outlier Robustness k Required
K-Means O(nkdi) Spherical Low Yes
K-Medoids O(k(n-k)²i) Any High Yes
DBSCAN O(n log n) Arbitrary High No
GMM O(nkdi) Elliptical Medium Yes
Hierarchical O(n³) Any Medium No

When to Use

✅ Large datasets
✅ Well-separated spherical clusters
✅ Continuous numeric data
✅ Need quick, interpretable results

When to Avoid

❌ Unknown k
❌ Non-spherical or overlapping clusters
❌ Heavy noise or outliers
❌ Categorical or mixed data


Verdict ⭐⭐⭐⭐☆ (4/5)

K-Means is a fast, scalable, and easy-to-use clustering algorithm ideal for large datasets and exploratory tasks.
However, it’s sensitive to initialization, scale, and outliers.

Strengths: Speed, simplicity, scalability
Weaknesses: Sensitive to initialization/outliers
Best For: Customer segmentation, image compression, initial data exploration

MeKaustubh07 avatar Oct 18 '25 17:10 MeKaustubh07