scipy BUG: Kruskal-Wallis test silently produces NaNs as test statistics and p-value if there are NaNs present in the input

Description

Python version: 3.11.5 SciPy version: 1.11.3

When there are NaNs in the input, the test silently produces NaN as an output (i.e., resulting statistics and p-value). It should at least give a warning suggesting that the NaNs in the input can be responsible. Otherwise the user may think that the p-value/stats is p=0, or p=1. Or, in general, it can be cumbersome to debug the issue without a feedback.

See other people also struggling with this issue: https://stackoverflow.com/questions/77087907/kruskal-wallis-test-always-gives-nan-values

Reproducing Code Example

import numpy as np
from scipy.stats import kruskal

# Sample data: scores from three different teaching methods
group1 = np.array([20, 21, 19, 22, 24, 23, 21, 20, 21, 19, 20])
group2 = np.array([22, 24, 25, 23, 23, 25, 26, 24, 21, 22, 30])

# Perform the Kruskal-Wallis H-test
stat, p = kruskal(group1, group2)

print('Statistics=%.3f, p=%.3f' % (stat, p))

# Sample data: scores from three different teaching methods
group1 = np.array([20, 21, 19, 22, 24, 23, 21, 20, 21, 19, np.nan])
group2 = np.array([22, 24, 25, 23, 23, 25, 26, 24, 21, 22, 30])

# Perform the Kruskal-Wallis H-test
stat, p = kruskal(group1, group2)

print('Statistics=%.3f, p=%.3f' % (stat, p))

Error message

Statistics=9.678, p=0.002
Statistics=nan, p=nan

# Well, it is not an error per se, but a bug from the usability point of view.

SciPy/NumPy/Python version and system information

Python version: 3.11.5
SciPy version: 1.11.3

$lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 22.04.3 LTS
Release:	22.04
Codename:	jammy



1.11.3 1.24.3 sys.version_info(major=3, minor=11, micro=5, releaselevel='final', serial=0)
Build Dependencies:
  blas:
    detection method: pkgconfig
    found: true
    include directory:
    lib directory:
    name: mkl-sdl
    openblas configuration: unknown
    pc file directory:
    version: '2023.1'
  lapack:
    detection method: pkgconfig
    found: true
    include directory:
    lib directory:
    name: mkl-sdl
    openblas configuration: unknown
    pc file directory:
    version: '2023.1'
  pybind11:
    detection method: pkgconfig
    include directory:
    name: pybind11
    version: 2.10.4
Compilers:
  c:
    commands: /croot/scipy_1696543286448/_build_env/bin/x86_64-conda-linux-gnu-cc
    linker: ld.bfd
    name: gcc
    version: 11.2.0
  c++:
    commands: /croot/scipy_1696543286448/_build_env/bin/x86_64-conda-linux-gnu-c++
    linker: ld.bfd
    name: gcc
    version: 11.2.0
  cython:
    commands: cython
    linker: cython
    name: cython
    version: 0.29.36
  fortran:
    commands: /croot/scipy_1696543286448/_build_env/bin/x86_64-conda-linux-gnu-gfortran
    linker: ld.bfd
    name: gcc
    version: 11.2.0
  pythran:
    include directory: ../../_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib/python3.11/site-packages/pythran
    version: 0.13.1
Machine Information:
  build:
    cpu: x86_64
    endian: little
    family: x86_64
    system: linux
  cross-compiled: false
  host:
    cpu: x86_64
    endian: little
    family: x86_64
    system: linux
Python Information:
  path:
  version: '3.11'

Feb 08 '24 22:02 mikbuch

Or, also, the same behaviour is observed (NaNs in the output), if one of the vectors (input) is empty. Which is not the same as some values missing/being NaNs -- the output is the same in both cases stats=nan and p=nan

Feb 08 '24 22:02 mikbuch

The same behavior is observed in most stats reducing functions, all NumPy reducing functions like np.sum, all (to my knowledge) elementwise functions in scipy.special and NumPy, and much more, so this is expected behavior, not a defect. If you want to make a case that NaNs in input should produce warnings for all similar functions, I'd suggest contacting the mailing list.

Feb 08 '24 23:02 mdhaber

Thank you for the suggestion, I will do so.

Feb 08 '24 23:02 mikbuch

Note that there is a nan_policy='raise' option if you want NaNs to raise errors, and nan_policy='omit' can be thought of as removing NaNs from the input before performing the calculation. The default is nan_policy='propagate' which returns the result that would be produced if NaNs were to propagate through the calculation (usually NaN, as you observed).

Feb 08 '24 23:02 mdhaber

Interesting, why is nan_policy not raise by default? Seems like that's much safer behavior. And could be adjusted through a deprecation cycle? (I'd argue that, in the context of SciPy, raising on nans makes a lot more sense than in NumPy, since here you know what they user is trying to compute.)

Feb 08 '24 23:02 stefanv

Interesting, why is nan_policy not raise by default?

Many stats functions have behaved that way since the beginning because propagating is what NaNs do naturally, and since then, the default behavior has not been changed.

And could be adjusted through a deprecation cycle?

Technically, yes. The default nan_policy could immediately change from 'propagate' to unspecified (like np._NoValue or None). Unspecified nan_policy with NaNs present would cause an appropriate deprecation warning to be emitted. The warning would state that during the deprecation period, nan_policy must be specified explicitly to silence the warning, and that after the deprecation period the default behavior would change. From there, the code would work just like 'propagate' for two releases. In the third release, the warning would go away and the default would change.

Technically, it would not be that hard on our side because essentially all functions that have any NaN handling all go through the same _contains_nan function. Changing all the tests would be a little more work. I suspect that such a PR would be similar to gh-20694 in scope.

Other options on SciPy's side include:

waiting for SciPy 2.0-
making a new nan_policy='warn' ('propagate' plus warning) the default immediately
adding an environment variable that sets the default nan_policy on one's machine
documenting explicitly that that there are two common causes of NaNs in the output of stats functions: NaNs in the input or samples that are too small. NaNs do not mean that the p-value is 0 or 1.

The much harder part would be getting enough support for one of these things, except maybe the last option. There hasn't been much participation here or on the mailing list (now discussion forum) - or the past 20 years, as it seems. So maybe this is something that can be adjusted locally with a function that accepts the stats namespace and returns an object that mimics the namespace except that it changes the default nan_policy of functions. I had Chat GPT whip up such a function, and it worked in limited testing. (I'd rather not have it posted on SciPy's repo, though, since I don't think we have even discussed a policy w.r.t. code from generative AI.)

import numpy as np
from scipy import stats as _stats  # rename to avoid suspected late-binding gotcha
stats = wrap_module_with_nan_policy(_stats)

stats.kruskal([1, 2, 3], [4, 5, 6], [np.nan, 8, 9])
# ValueError: The input contains nan values

stats.kruskal([1, 2, 3], [4, 5, 6], [np.nan, 8, 9], nan_policy='propagate')
# KruskalResult(statistic=nan, pvalue=nan)

Could you try that out and see if it meets your needs?

May 30 '24 07:05 mdhaber

scipy scipy copied to clipboard

BUG: Kruskal-Wallis test silently produces NaNs as test statistics and p-value if there are NaNs present in the input

Description

Reproducing Code Example

Error message

SciPy/NumPy/Python version and system information

scipy
scipy copied to clipboard