golearn icon indicating copy to clipboard operation
golearn copied to clipboard

K Means clustering

Open sjwhitworth opened this issue 11 years ago • 8 comments

sjwhitworth avatar May 06 '14 23:05 sjwhitworth

I would be happy to work on this.

hpxro7 avatar May 31 '14 01:05 hpxro7

Hi @hpxro7 It would be great if you could fork this repo and open a new branch for this feature. After finishing your work, you can then just send us a pull request! :beer:

lazywei avatar May 31 '14 08:05 lazywei

Quick question: what is our clustering interface going to look like? I was thinking of introducing a SetAttribute, but this requires support within Instances for types longer than 64 bits, and I don't have time to refactor the code right now.

Sentimentron avatar May 31 '14 19:05 Sentimentron

Hi @lazywei, absolutely. I'll be getting on that now :+1:!

@Sentimentron Could you briefly expand on the purpose of SetAttribute? I'm assuming the clustering algorithms will adhere to the Estimator and Predictor interfaces.

I had some questions myself :). I am perhaps misunderstanding the type but is Instances meant exclusively for data with class labels or should ClassIndex be simply omitted for unsupervised learning?

hpxro7 avatar Jun 01 '14 00:06 hpxro7

So I thought there might be a few possibilities for what gets returned from a clustering algorithm.

  • A slice of row numbers + clusters, packed in some struct
  • * This requires introduction of an IntAttribute
  • A map of row numbers to clusters
  • A map of clusters to row numbers
  • A set of instances
     cluster (IntAttribute)              members (SetAttribute)
     1                                           1, 2, 3, 4
     2                                           4, 5, 6, 7

WRT ClassIndex the next batch of work I'm planning will allow more than one or none at all as per @lazywei's suggestion. For now, I'd probably check if the ClassIndex is set to -1, and if it isn't, ignoring that attribute.

Edit: Also, I will need to implement an IntAttribute, so I'll see if I can get that done today.

Sentimentron avatar Jun 01 '14 08:06 Sentimentron

OK, so IntAttribute implemented in #39

Sentimentron avatar Jun 01 '14 12:06 Sentimentron

Great, thanks a lot for the clarifications.

Until we've got SetAttribute I'll implement Predict of K-Means to return a map from row numbers to clusters.

hpxro7 avatar Jun 01 '14 22:06 hpxro7

OK, that sounds good.

Sentimentron avatar Jun 02 '14 08:06 Sentimentron