golearn
golearn copied to clipboard
K Means clustering
I would be happy to work on this.
Hi @hpxro7 It would be great if you could fork this repo and open a new branch for this feature. After finishing your work, you can then just send us a pull request! :beer:
Quick question: what is our clustering interface going to look like? I was thinking of introducing a SetAttribute, but this requires support within Instances for types longer than 64 bits, and I don't have time to refactor the code right now.
Hi @lazywei, absolutely. I'll be getting on that now :+1:!
@Sentimentron Could you briefly expand on the purpose of SetAttribute? I'm assuming the clustering algorithms will adhere to the Estimator and Predictor interfaces.
I had some questions myself :). I am perhaps misunderstanding the type but is Instances meant exclusively for data with class labels or should ClassIndex be simply omitted for unsupervised learning?
So I thought there might be a few possibilities for what gets returned from a clustering algorithm.
- A slice of row numbers + clusters, packed in some
struct - * This requires introduction of an
IntAttribute - A map of row numbers to clusters
- A map of clusters to row numbers
- A set of instances
cluster (IntAttribute) members (SetAttribute)
1 1, 2, 3, 4
2 4, 5, 6, 7
WRT ClassIndex the next batch of work I'm planning will allow more than one or none at all as per @lazywei's suggestion. For now, I'd probably check if the ClassIndex is set to -1, and if it isn't, ignoring that attribute.
Edit: Also, I will need to implement an IntAttribute, so I'll see if I can get that done today.
OK, so IntAttribute implemented in #39
Great, thanks a lot for the clarifications.
Until we've got SetAttribute I'll implement Predict of K-Means to return a map from row numbers to clusters.
OK, that sounds good.