FSharp.Stats
FSharp.Stats copied to clipboard
[Feature Request] Improve robustness of PearsonChiSquared
Is your feature request related to a problem? Please describe.
I have datasets where the is a disparity in the scale of the observed vs theoretical distributions. Using the FSharp.Stats implementation of ChiSquared, gives crazy high chi statistics and a pvalue of zero every time. Other implementations give reasonable values.
Describe the solution you'd like I believe more robust methods fcan handle this type of data (I'm not an expert). Minimally, suggest looking at the Meta.Numerics implementation. I've excerpted the key fragments below. I don't think there's any need to change the ChiX distribution itself, just how the Chi statistic is aggregated. I can probably mix and match the two implementations to work out which piece needs fixing.
Describe alternatives you've considered
The Meta.Numerics package gives robust answers. it is limited to integer counts in its contingency tables, so I was hoping use the FSharp.Stats version (which is also simpler to use). I have for now hacked the Meta.Numerics solution to use floats but ideally this version would have the same robustness.
Additional context
This is the meta.numerics implementation. Not it takes row and column counts into account and scales with a factor n.
// compute chi squared
double chi2 = 0.0;
for (int r = 0; r < rCounts.Length; r++) {
for (int c = 0; c < cCounts.Length; c++) {
double n = ((double) rCounts[r]) * cCounts[c] / tCounts;
double z = counts[r, c] - n;
chi2 += z * z / n;
}
}
The fsharp.stats is just taking the observed and expected and squaring the difference.
let chi2 =
Seq.zip observed expected
|> Seq.fold (fun acc (obs,exp) ->
let d = obs - exp
acc + (d * d) / exp) 0.0
As a concrete example
Using this https://www.socscistatistics.com/tests/chisquare2/default2.aspx calculator we have two vectors with 3 categories. The first (observed) vector is effectively [ 3 ; 1 ; 1 ]. The expected is [3134 ; 285 ; 868]. The calculator gives p=value 0.483447 and chi-square statistic 1.4536.
The current package gives a very large statistic and PValue of 0.0;
let obs = [ 3 ; 1 ; 1 ]
let expected = [3134 ; 285 ; 868]
let obsFloat = obs |> List.map float
let expectedFloat = expected |> List.map float
open FSharp.Stats.Testing.ChiSquareTest
let dof = obs.Length - 1
let result = compute dof obsFloat expectedFloat
> let result = compute dof obsFloat expectedFloat;;
val result: Testing.TestStatistics.ChiSquareStatistics =
{ Statistic = 4100065.333
DegreesOfFreedom = 2.0
PValueLeft = 1.0
PValueRight = 0.0
PValue = 0.0 }
Here's the meta.numerics equivalent
#r "nuget:Meta.Numerics"
open Meta.Numerics.Statistics
let rowLabels = [| 0 ; 1 ; 2 |]
let colLabels = [| 0 ; 1 |]
let table = Meta.Numerics.Statistics.ContingencyTable<int,int>(rowLabels,colLabels)
table.[0,0] <- 3134
table.[1,0] <- 285
table.[2,0] <- 868
table.[0,1] <- 3
table.[1,1] <- 1
table.[2,1] <- 1
let chisq = table.PearsonChiSquaredTest()
chisq.Statistic
chisq.Probability
It reports same values to the online calculator, so the FSharp.Stats vers
> chisq.Statistic;;
val it: ContinuousTestStatistic =
χ² = 1.4536291870114808
{Distribution = Meta.Numerics.Statistics.Distributions.ChiSquaredDistribution;
Name = "χ²";
Value = 1.453629187;}
> chisq.Probability;;
val it: float = 0.4834465136
If I take their calculation for the Chi statistic, and feed it (1.453629187) into your implementation of the Chi distribution
(ugly hack but can make a nicer version of this)
let colCounts = List.zip obsFloat expectedFloat |> List.map (fun (o, e) -> o + e) |> Array.ofList
let rowCounts = [| obsFloat |> List.sum ; expectedFloat |> List.sum |]
let tCount = obsFloat@expectedFloat |> List.sum
// modified chi-value code
let chiValue =
[| for i in 0..obs.Length - 1 do
for j in 0..1 do
let n = rowCounts.[j] * colCounts.[i] / tCount
let cell = if j = 0 then obsFloat.[i] else expectedFloat.[i]
let z = cell - n
yield z*z / n
|] |> Array.sum
FSharp.Stats.Testing.TestStatistics.createChiSquare chiValue dof
It gets a PValueRight that matches the value from Meta.Numerics. I would expect the PValue to match, I think so might still be an error here somewhere but it is at least in the right ball park. For now, I'm going to take this approach so I can start using your implementation but would appreciate thoughts from someone who has spent more time than I did, on whether the modified chi statistic is the way to go in general,
Thanks in advance..
val it: Testing.TestStatistics.ChiSquareStatistics =
{ Statistic = 1.453629187
DegreesOfFreedom = 2.0
PValueLeft = 0.516553486
PValueRight = 0.483446514
PValue = 0.966893028 }
>```