Feat : k_means Algorithm
K-Means Clustering Algorithm – Summary Review
Overview
K-Means is a centroid-based clustering algorithm that divides n data points into k clusters using centroids (means). It minimizes within-cluster variance and is known for its simplicity and speed.
Algorithm Summary
- Initialize: Choose k centroids (random or k-means++)
- Assign: Each point → nearest centroid
- Update: Recalculate centroids as mean of assigned points
- Repeat: Until centroids stabilize or max iterations reached
Key Traits:
- Uses centroids (means)
- Minimizes intra-cluster distance
- Iterative refinement until convergence
Complexity
- Time: O(n × k × d × i)
- Space: O(n × d + k × d)
- Scalability: Excellent for large datasets
Initialization Methods
- K-Means++ (Recommended): Faster convergence, O(n × k × d)
- Random: Simpler but less stable
- Custom: User-defined centroids
Quality Metrics
| Metric | Formula / Concept | Ideal Value |
|---|---|---|
| Inertia | Σ | |
| Silhouette Score | (b(i) - a(i)) / max(a(i), b(i)) | Close to 1 |
| Davies-Bouldin Index | Avg cluster similarity ratio | Lower = better |
| Calinski-Harabasz Index | Between / Within variance ratio | Higher = better |
Key Methods
| Method | Description |
|---|---|
fit(X) |
Train model |
predict(X) |
Assign clusters |
fit_predict(X) |
Train + predict |
transform(X) |
Distances to centroids |
silhouette_score() |
Cluster quality |
get_centroids() |
Retrieve final centroids |
Advantages ✅
- Fast & Scalable — Efficient for large datasets
- Simple & Intuitive — Easy to interpret geometrically
- Guaranteed Convergence — Always reaches a local minimum
- Flexible — Works in any number of dimensions
- Memory Efficient — Linear space complexity
Disadvantages ❌
- Requires predefined k
- Sensitive to initialization (mitigated by k-means++)
- Assumes spherical clusters
- Sensitive to outliers
- Can converge to local minima
- Scale-dependent — Requires feature standardization
Best Practices
- Standardize data:
X_scaled <- scale(X) - Find optimal k: Use elbow or silhouette method
- Use k-means++: Better initialization
- Set tolerance:
tol = 1e-4, limit iterations withmax_iter = 300
Use Cases
✅ Customer segmentation
✅ Image compression (color quantization)
✅ Document clustering
✅ Anomaly detection
✅ Exploratory data analysis
❌ Avoid for: non-spherical clusters, mixed data, or many outliers
Performance Tips
- Large datasets: Use Mini-Batch K-Means for faster results
- High dimensions: Apply PCA or feature selection
- Early stopping: Use relaxed tolerance (e.g., tol = 1e-3)
Comparison with Other Algorithms
| Algorithm | Time | Cluster Shape | Outlier Robustness | k Required |
|---|---|---|---|---|
| K-Means | O(nkdi) | Spherical | Low | Yes |
| K-Medoids | O(k(n-k)²i) | Any | High | Yes |
| DBSCAN | O(n log n) | Arbitrary | High | No |
| GMM | O(nkdi) | Elliptical | Medium | Yes |
| Hierarchical | O(n³) | Any | Medium | No |
When to Use
✅ Large datasets
✅ Well-separated spherical clusters
✅ Continuous numeric data
✅ Need quick, interpretable results
When to Avoid
❌ Unknown k
❌ Non-spherical or overlapping clusters
❌ Heavy noise or outliers
❌ Categorical or mixed data
Verdict ⭐⭐⭐⭐☆ (4/5)
K-Means is a fast, scalable, and easy-to-use clustering algorithm ideal for large datasets and exploratory tasks.
However, it’s sensitive to initialization, scale, and outliers.
Strengths: Speed, simplicity, scalability
Weaknesses: Sensitive to initialization/outliers
Best For: Customer segmentation, image compression, initial data exploration