math-php icon indicating copy to clipboard operation
math-php copied to clipboard

Normality test

Open 8ctopus opened this issue 1 year ago • 2 comments

First of all thank you for this amazing library! Also I want to apologize if I overlooked something as I'm not a math genius.

I'm wondering if there is any implementation of normality tests yet?

The idea is that considering a bunch of data, for example, the height of students in a college, is to check whether the data follows a normal distribution (Gaussian curve).

8ctopus avatar Nov 11 '24 07:11 8ctopus

Hi @8ctopus,

Thank you for your interest in MathPHP.

We have the χ² (chi-squared) test in Statistics\Significance, which can be used in further calculations to get at what you are asking, but I don't think we have any normality tests as is that return a true/false answer or some probability.

I think it is a good feature to add. The Wikipedia article lists many tests. If you had to pick only one to implement, which one would be the most useful to have implemented?

Thanks again for your suggestions and feedback. Mark

markrogoyski avatar Nov 12 '24 00:11 markrogoyski

@markrogoyski Hello Mark,

I have used the chi-squared test before in medical statistics and it works great provided the data is normally distributed (if not you can't use it as you already know).

So far, I have roughly tested normality two ways:

  • creating a histogram, then drawing it (if the curve looks normal then it most likely is)
  • using a not so bad approximation:
/**
 * Approximate normality test
 *
 * @param array $data
 *
 * @return float - percentage
 *
 * @note found here https://www.paulstephenborile.com/2018/03/code-benchmarks-can-measure-fast-software-make-faster/
 */
public static function testNormality(array $data) : float
{
    $mean = self::mean($data);
    $median = self::median($data);

    return abs($mean - $median) / max($mean, $median);
}

Both approaches are empirical and therefore I don't think they fit into your library.

Going back to the Wikipedia article, it says:

A 2011 study concludes that Shapiro–Wilk has the best power for a given significance, followed closely by Anderson–Darling when comparing the Shapiro–Wilk, Kolmogorov–Smirnov, Lilliefors, and Anderson–Darling tests.[1]

So my best guess, as I have no experience, would be the Shapiro-Wilk test. I actually found an article that explains really well how it works:

https://medium.com/@austinej86/understanding-the-shapiro-wilk-test-a-key-tool-for-testing-normality-14ae5107b6b5

8ctopus avatar Nov 12 '24 08:11 8ctopus