KDEpy icon indicating copy to clipboard operation
KDEpy copied to clipboard

Added weighting of silverman and scott

Open tommyod opened this issue 3 years ago • 2 comments

tommyod avatar Nov 23 '20 21:11 tommyod

Thanks for the comments @lukedyer-peak .

This was not as straightforward as I first thought. If you have any more thoughts let me know.

  • The standard deviation is computed using ddof = 1, i.e. the sample standard deviation with n-1 in the denominator. With weights my immediate generalization was sum(weights)-1, but often the weights sum to unity. I'm considering scaling weights so the smallest weight equals one, this way the sample standard deviation substracts the smallest weight. But I don't think that's a common way of doing it.
  • Weighted percentiles were also non-trivial. I found some code snippets online, but none that were very good. Many failed the property that repeated observations should equal integer weights, i.e. that data = [0, 1, 1] should equal data = [0, 1] with weights = [1, 2].
  • I believe the intuitive property that data = [0, 1, 1] should equal data = [0, 1] with weights = [1, 2] should apply to the entire KDEpy library. I don't see any other possible interpretation that makes sense.
  • Weights should probably not be allowed to be zero (which is equal to data not being there in the first place). This choice should be consistent, but it's most important in the first check of weights. (Many sub-routines also check weights, just for sanity).

tommyod avatar Nov 26 '20 10:11 tommyod

  • The standard deviation is computed using ddof = 1, i.e. the sample standard deviation with n-1 in the denominator. With weights my immediate generalization was sum(weights)-1, but often the weights sum to unity. I'm considering scaling weights so the smallest weight equals one, this way the sample standard deviation substracts the smallest weight. But I don't think that's a common way of doing it.

I think it would be helpful to define what is meant by the weights. I'm not a statistical expert but there are 2 different ways weights meaning weights can have here. I think restricting to one case or another might help - and documenting what is meant these weights would be useful too. Wiki describes 2 different ways of calculating a weighted std dev with either frequency or reliability weights (note in some formula on that wiki page they assume that the weights have been normalised so that they sum to 1). I personally think it might be best to go with the reliability weights, which GNU also go with in their science library. In some places reliability weights are just talked of as weights and frequency weights as frequency - see this explanation in a SAS blog.

  • Weighted percentiles were also non-trivial. I found some code snippets online, but none that were very good. Many failed the property that repeated observations should equal integer weights, i.e. that data = [0, 1, 1] should equal data = [0, 1] with weights = [1, 2].

I think this logic (of using reliability weights) should follow through naturally to calculating quantiles. One could think of sampling with these weights and taking quantiles from the sampled distributions. Then if you follow that logic through it would lead to something like this code snipped from SO.

  • Weights should probably not be allowed to be zero (which is equal to data not being there in the first place). This choice should be consistent, but it's most important in the first check of weights. (Many sub-routines also check weights, just for sanity).

I have some personal motivation to allow 0 weighting, which would correspond to ignoring that observation. This is as I'm planning on using this package. (I can implement this logic on my side though). There evidence for this approach being "standard" or "expected" too as numpy allows weights to be 0 (and probabilities to be 0 in the random module).

lukedyer-peak avatar Dec 02 '20 16:12 lukedyer-peak