php-kmeans icon indicating copy to clipboard operation
php-kmeans copied to clipboard

Multidimensional arrays and diversity clustering

Open LarryBarker opened this issue 2 years ago • 3 comments

Hello, thank you for sharing this package. I'm hoping to use it to help group users into diverse groups based on socioeconomic factors like race, gender, age, etc. Our dataset contains 20 factors that need to be taken into consideration. Have you used this to solve such a problem?

I've started some preliminary testing, and seem to be getting results but I can't tell what is happening behind the scenes. Furthermore, I would like to be able to weight each factor. For example, race may be the most important factor in some cases, while gender may be in others.

Here is what the data looks like:

 user_id => [
    race,
    gender,
    age
 ]

The numerical representation for each possible value is what we store:

array:10 [
  1 => array:3 [
    0 => -10
    1 => 6
    2 => 1
  ]
  2 => array:3 [
    0 => 3
    1 => 2
    2 => 1
  ]
  3 => array:3 [
    0 => 2
    1 => 1
    2 => 5
  ]
  4 => array:3 [
    0 => 9
    1 => 3
    2 => 4
  ]
  5 => array:3 [
    0 => -12
    1 => 6
    2 => 0
  ]
  6 => array:3 [
    0 => -6
    1 => 7
    2 => 3
  ]
  7 => array:3 [
    0 => 7
    1 => 7
    2 => 5
  ]
  8 => array:3 [
    0 => 4
    1 => 4
    2 => 0
  ]
  9 => array:3 [
    0 => 5
    1 => 7
    2 => 1
  ]
  10 => array:3 [
    0 => -11
    1 => 3
    2 => 2
  ]
]

I'm curious as well, after the clustering is performed, is there anyway to retrieve the original key for the data? This is needed because I need to know which users are in each cluster.

If this is not the appropriate channel for this type of question, or beyond the scope of the repo, please let me know. I certainly appreciate any feedback you may have. Thank you :)

LarryBarker avatar Oct 07 '21 19:10 LarryBarker

Hey, @LarryBarker thanks for submitting an issue,

I can't tell what is happening behind the scenes.

You may use a callback function to tap into the algorithm execution :+1:

Furthermore, I would like to be able to weight each factor. For example, race may be the most important factor in some cases, while gender may be in others.

I haven't implemented weighted k-means yet. Development effort is now focused on a new v3 version, designed to be easier to extend/override with your own custom algorithms. Have a look here V3

I'm curious as well, after the clustering is performed, is there anyway to retrieve the original key for the data? This is needed because I need to know which users are in each cluster.

You can achieve this by assigning arbitrary data to points :+1:

Thank you for using PHP Kmeans and don't hesitate to let us know if you have any feature requests or if you encounter any bugs.

Have a nice day

bdelespierre avatar Oct 08 '21 08:10 bdelespierre

@bdelespierre Thanks for the quick reply! I realized after I posted I could attach data to points, so thank you for confirming that.

It's good to know that weighted kmeans is something you have thought about. I assume it is doable? Any resources you might have to help me implement my own?

LarryBarker avatar Oct 13 '21 20:10 LarryBarker

There is unfortunately very little litterature on the topic so I just assumed this is not what the users wanted. That being said, finding the centroid of a group of weighted points is a piece of cake. But I'm not quite sure how to interpret the resuts...

bdelespierre avatar Oct 16 '21 17:10 bdelespierre