k-means-constrained icon indicating copy to clipboard operation
k-means-constrained copied to clipboard

[Enhancement] Add support for sample_weight in the fit function

Open RowdyHowell opened this issue 3 years ago • 6 comments
trafficstars

The scikit-learn KMeans algorithm allows support for supplying a weight for each sample in the fit function. See the docs here.

Is this possible to add into the algorithm? i.e. can we have the minimum and maximum bounds account for the sum of all weights instead of the count of all samples? I haven't read into the MinCostFlow algorithm so I don't know how feasible this would be.

RowdyHowell avatar Apr 28 '22 23:04 RowdyHowell

Hi, thanks for your interest in the library 🙂

It would be possible to support that. I think it would be equivalent to weighting the distances. The distances are generated here. You could change the euclidean_distances function to accept a weight parameter. I haven’t looked at how scikit-learn does it but it might be possible to backport the code from there.

I accept contributions to the library. If you would be willing to write a PR - I could review it

joshlk avatar May 02 '22 11:05 joshlk

I've had a rethink...

I've had a look at how scikit-learn defines sample_weights:

The algorithm supports sample weights, which can be given by a parameter sample_weight. This allows to assign more weight to some samples when computing cluster centers and values of inertia. For example, assigning a weight of 2 to a sample is equivalent to adding a duplicate of that sample to the dataset .

Which I think its different to what I said:

I think it would be equivalent to weighting the distances.

and how you described it:

can we have the minimum and maximum bounds account for the sum of all weights instead of the count of all samples?

All of the above is possible - it's just about figuring out what to weight. Feel free to have a shot at it. I will also have a longer think about what is needed

joshlk avatar May 02 '22 17:05 joshlk

@joshlk I think I have a similar need. In the problem I'm trying to solve, size_max is the sum of the weights of a cluster instead the size of a cluster. A point of X is the centroid of a polygon and the weight of that polygon (point of X) is the sum of its vertices. Do you think the algorithm can be easily modified to handle this?

hectoradrian961030 avatar Jul 03 '22 15:07 hectoradrian961030

Just stumbled across this project, nice work @joshlk ! I also have the same use case (each point has a weight, constraints are on the sum of the weights, distances can be weighted or unweighted). Thankfully all of my weights are integers (and none are very big), so I think as a workaround for now I will literally just duplicate the sample points when feeding them into the model. I may also have a look at your codebase and see if I think of anything.

ischmidt20 avatar Jul 23 '22 23:07 ischmidt20

So after thinking about this a little more, I don't think having the weights factor into the bounds is feasible. This would be equivalent to adjusting the supply at each node in the min-cost network flow problem. However, even assuming the weights are integers, there is no requirement that all of the flow from one node would follow the same arc. This means that when solving the linear program, fractions of points could be assigned to different clusters (e.g. a point with weight of 10 could have 6 units of flow to one cluster and 4 to another). Ensuring that all of the flow goes in one direction would elevate the problem to an integer program and while it would probably still be solvable for most use cases, it would increase running time dramatically. Of course, the algorithm could simply return the fractional assignments, but I can't imagine why this would be useful, and it would require adjusting the API (what would .labels_ be?)

I think it would be equivalent to weighting the distances.

This should still be feasible, considering that all this would do is weight the costs of each arc in the network flow problem. Sample weights could be used in this way without the algorithm needing to change at all to achieve the same result.

For example, assigning a weight of 2 to a sample is equivalent to adding a duplicate of that sample to the dataset.

The way this reads to me is that this is just the combination of the above. In unconstrained K-means, "adding a duplicate point" is only equivalent to weighting the distances, because two points at the same location will always end up in the same cluster. However, in the constrained problem, when you solve the min-cost network flow, the two copies could end up in different clusters, depending on the constraints. So this is also not really feasible.

So it seems as if any sample_weight parameter really only can weight the distances, as described. However, as the conversation here so far indicates, there are many different interpretations of what "sample weight" could mean, so it is important that if such a feature is added, the documentation is clear on the matter.

ischmidt20 avatar Jul 24 '22 19:07 ischmidt20

I’m happy to take contributions to the project if you would like to have a go adding this as a feature

joshlk avatar Jul 26 '22 05:07 joshlk

I'm closing this ticket for now. Feel free to re-open the ticket and comment if your still looking at this and want more assistance

joshlk avatar Sep 09 '22 13:09 joshlk

I am also looking for this feature for a project I am working on. I would love to contribute but I am not sure how - there are valid design concerns stated by @ischmidt20 that I'm not sure I know how to handle best. I do believe that I know how weights should be applied - by a combination of point 1 and point 3 from your rethink comment (should be equivalent to duplicating a point as well as the bounds being dependent on the sum of all weights, as this would work the same way with weights == 1). I'm not sure if simply weighting the edges would solve this problem, or if it could be resolved by simply adjusting the means to account for the weights after they have been assigned, but I will definitely take a look at it!

Frank-Triolo avatar Oct 14 '22 17:10 Frank-Triolo