FSharp.Stats icon indicating copy to clipboard operation
FSharp.Stats copied to clipboard

[Feature Request] Improve robustness of PearsonChiSquared

Open daz10000 opened this issue 6 months ago • 3 comments

Is your feature request related to a problem? Please describe.

I have datasets where the is a disparity in the scale of the observed vs theoretical distributions. Using the FSharp.Stats implementation of ChiSquared, gives crazy high chi statistics and a pvalue of zero every time. Other implementations give reasonable values.

Describe the solution you'd like I believe more robust methods fcan handle this type of data (I'm not an expert). Minimally, suggest looking at the Meta.Numerics implementation. I've excerpted the key fragments below. I don't think there's any need to change the ChiX distribution itself, just how the Chi statistic is aggregated. I can probably mix and match the two implementations to work out which piece needs fixing.

Describe alternatives you've considered The Meta.Numerics package gives robust answers. it is limited to integer counts in its contingency tables, so I was hoping use the FSharp.Stats version (which is also simpler to use). I have for now hacked the Meta.Numerics solution to use floats but ideally this version would have the same robustness.

Additional context

This is the meta.numerics implementation. Not it takes row and column counts into account and scales with a factor n.

// compute chi squared
            double chi2 = 0.0;
            for (int r = 0; r < rCounts.Length; r++) {
                for (int c = 0; c < cCounts.Length; c++) {
                    double n = ((double) rCounts[r]) * cCounts[c] / tCounts;
                    double z = counts[r, c] - n;
                    chi2 += z * z / n;
                }
            }

The fsharp.stats is just taking the observed and expected and squaring the difference.

let chi2 =
            Seq.zip observed expected
            |> Seq.fold (fun acc (obs,exp) -> 
                let d = obs - exp
                acc + (d * d) / exp) 0.0

As a concrete example

Using this https://www.socscistatistics.com/tests/chisquare2/default2.aspx calculator we have two vectors with 3 categories. The first (observed) vector is effectively [ 3 ; 1 ; 1 ]. The expected is [3134 ; 285 ; 868]. The calculator gives p=value 0.483447 and chi-square statistic 1.4536.

Image

The current package gives a very large statistic and PValue of 0.0;

let obs = [ 3 ; 1 ; 1 ] 
let expected = [3134 ; 285 ; 868]

let obsFloat = obs |> List.map float
let expectedFloat = expected |> List.map float

open FSharp.Stats.Testing.ChiSquareTest

let dof = obs.Length - 1
let result = compute dof obsFloat expectedFloat
> let result = compute dof obsFloat expectedFloat;;
val result: Testing.TestStatistics.ChiSquareStatistics =
  { Statistic = 4100065.333
    DegreesOfFreedom = 2.0
    PValueLeft = 1.0
    PValueRight = 0.0
    PValue = 0.0 }

Here's the meta.numerics equivalent

#r "nuget:Meta.Numerics"

open Meta.Numerics.Statistics

let rowLabels = [| 0 ; 1 ; 2 |]
let colLabels = [| 0 ; 1 |]
let table = Meta.Numerics.Statistics.ContingencyTable<int,int>(rowLabels,colLabels)
table.[0,0] <- 3134
table.[1,0] <- 285
table.[2,0] <- 868
table.[0,1] <- 3
table.[1,1] <- 1    
table.[2,1] <- 1

let chisq = table.PearsonChiSquaredTest()
chisq.Statistic
chisq.Probability

It reports same values to the online calculator, so the FSharp.Stats vers

> chisq.Statistic;;
val it: ContinuousTestStatistic =
  χ² = 1.4536291870114808
    {Distribution = Meta.Numerics.Statistics.Distributions.ChiSquaredDistribution;
     Name = "χ²";
     Value = 1.453629187;}

> chisq.Probability;;
val it: float = 0.4834465136

If I take their calculation for the Chi statistic, and feed it (1.453629187) into your implementation of the Chi distribution

(ugly hack but can make a nicer version of this)

let colCounts = List.zip obsFloat expectedFloat |> List.map (fun (o, e) -> o + e) |> Array.ofList
let rowCounts = [| obsFloat |> List.sum  ; expectedFloat |> List.sum |]

let tCount = obsFloat@expectedFloat |> List.sum
// modified chi-value code
let chiValue = 
    [|  for i in 0..obs.Length - 1 do
            for j in 0..1 do
                let n = rowCounts.[j] * colCounts.[i] / tCount
                let cell = if j = 0 then obsFloat.[i] else expectedFloat.[i]
                let z = cell - n
                yield z*z / n
    |] |> Array.sum

FSharp.Stats.Testing.TestStatistics.createChiSquare chiValue dof 

It gets a PValueRight that matches the value from Meta.Numerics. I would expect the PValue to match, I think so might still be an error here somewhere but it is at least in the right ball park. For now, I'm going to take this approach so I can start using your implementation but would appreciate thoughts from someone who has spent more time than I did, on whether the modified chi statistic is the way to go in general,

Thanks in advance..

val it: Testing.TestStatistics.ChiSquareStatistics =
  { Statistic = 1.453629187
    DegreesOfFreedom = 2.0
    PValueLeft = 0.516553486
    PValueRight = 0.483446514
    PValue = 0.966893028 }

>```

daz10000 avatar Apr 19 '25 00:04 daz10000