scipy
scipy copied to clipboard
BUG: Kruskal-Wallis test silently produces NaNs as test statistics and p-value if there are NaNs present in the input
Description
Python version: 3.11.5 SciPy version: 1.11.3
When there are NaNs in the input, the test silently produces NaN as an output (i.e., resulting statistics and p-value). It should at least give a warning suggesting that the NaNs in the input can be responsible. Otherwise the user may think that the p-value/stats is p=0, or p=1. Or, in general, it can be cumbersome to debug the issue without a feedback.
See other people also struggling with this issue: https://stackoverflow.com/questions/77087907/kruskal-wallis-test-always-gives-nan-values
Reproducing Code Example
import numpy as np
from scipy.stats import kruskal
# Sample data: scores from three different teaching methods
group1 = np.array([20, 21, 19, 22, 24, 23, 21, 20, 21, 19, 20])
group2 = np.array([22, 24, 25, 23, 23, 25, 26, 24, 21, 22, 30])
# Perform the Kruskal-Wallis H-test
stat, p = kruskal(group1, group2)
print('Statistics=%.3f, p=%.3f' % (stat, p))
# Sample data: scores from three different teaching methods
group1 = np.array([20, 21, 19, 22, 24, 23, 21, 20, 21, 19, np.nan])
group2 = np.array([22, 24, 25, 23, 23, 25, 26, 24, 21, 22, 30])
# Perform the Kruskal-Wallis H-test
stat, p = kruskal(group1, group2)
print('Statistics=%.3f, p=%.3f' % (stat, p))
Error message
Statistics=9.678, p=0.002
Statistics=nan, p=nan
# Well, it is not an error per se, but a bug from the usability point of view.
SciPy/NumPy/Python version and system information
Python version: 3.11.5
SciPy version: 1.11.3
$lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 22.04.3 LTS
Release: 22.04
Codename: jammy
1.11.3 1.24.3 sys.version_info(major=3, minor=11, micro=5, releaselevel='final', serial=0)
Build Dependencies:
blas:
detection method: pkgconfig
found: true
include directory:
lib directory:
name: mkl-sdl
openblas configuration: unknown
pc file directory:
version: '2023.1'
lapack:
detection method: pkgconfig
found: true
include directory:
lib directory:
name: mkl-sdl
openblas configuration: unknown
pc file directory:
version: '2023.1'
pybind11:
detection method: pkgconfig
include directory:
name: pybind11
version: 2.10.4
Compilers:
c:
commands: /croot/scipy_1696543286448/_build_env/bin/x86_64-conda-linux-gnu-cc
linker: ld.bfd
name: gcc
version: 11.2.0
c++:
commands: /croot/scipy_1696543286448/_build_env/bin/x86_64-conda-linux-gnu-c++
linker: ld.bfd
name: gcc
version: 11.2.0
cython:
commands: cython
linker: cython
name: cython
version: 0.29.36
fortran:
commands: /croot/scipy_1696543286448/_build_env/bin/x86_64-conda-linux-gnu-gfortran
linker: ld.bfd
name: gcc
version: 11.2.0
pythran:
include directory: ../../_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib/python3.11/site-packages/pythran
version: 0.13.1
Machine Information:
build:
cpu: x86_64
endian: little
family: x86_64
system: linux
cross-compiled: false
host:
cpu: x86_64
endian: little
family: x86_64
system: linux
Python Information:
path:
version: '3.11'
Or, also, the same behaviour is observed (NaNs in the output), if one of the vectors (input) is empty. Which is not the same as some values missing/being NaNs -- the output is the same in both cases stats=nan and p=nan
The same behavior is observed in most stats
reducing functions, all NumPy reducing functions like np.sum
, all (to my knowledge) elementwise functions in scipy.special
and NumPy, and much more, so this is expected behavior, not a defect. If you want to make a case that NaNs in input should produce warnings for all similar functions, I'd suggest contacting the mailing list.
Thank you for the suggestion, I will do so.
Note that there is a nan_policy='raise'
option if you want NaNs to raise errors, and nan_policy='omit'
can be thought of as removing NaNs from the input before performing the calculation. The default is nan_policy='propagate'
which returns the result that would be produced if NaNs were to propagate through the calculation (usually NaN, as you observed).
Interesting, why is nan_policy
not raise
by default? Seems like that's much safer behavior. And could be adjusted through a deprecation cycle? (I'd argue that, in the context of SciPy, raising on nans makes a lot more sense than in NumPy, since here you know what they user is trying to compute.)
Interesting, why is nan_policy not raise by default?
Many stats functions have behaved that way since the beginning because propagating is what NaNs do naturally, and since then, the default behavior has not been changed.
And could be adjusted through a deprecation cycle?
Technically, yes. The default nan_policy
could immediately change from 'propagate'
to unspecified (like np._NoValue
or None
). Unspecified nan_policy
with NaNs present would cause an appropriate deprecation warning to be emitted. The warning would state that during the deprecation period, nan_policy
must be specified explicitly to silence the warning, and that after the deprecation period the default behavior would change. From there, the code would work just like 'propagate'
for two releases. In the third release, the warning would go away and the default would change.
Technically, it would not be that hard on our side because essentially all functions that have any NaN handling all go through the same _contains_nan
function. Changing all the tests would be a little more work. I suspect that such a PR would be similar to gh-20694 in scope.
Other options on SciPy's side include:
- waiting for SciPy 2.0-
- making a new
nan_policy='warn'
('propagate'
plus warning) the default immediately - adding an environment variable that sets the default
nan_policy
on one's machine - documenting explicitly that that there are two common causes of NaNs in the output of stats functions: NaNs in the input or samples that are too small. NaNs do not mean that the p-value is 0 or 1.
The much harder part would be getting enough support for one of these things, except maybe the last option. There hasn't been much participation here or on the mailing list (now discussion forum) - or the past 20 years, as it seems. So maybe this is something that can be adjusted locally with a function that accepts the stats
namespace and returns an object that mimics the namespace except that it changes the default nan_policy
of functions. I had Chat GPT whip up such a function, and it worked in limited testing. (I'd rather not have it posted on SciPy's repo, though, since I don't think we have even discussed a policy w.r.t. code from generative AI.)
import numpy as np
from scipy import stats as _stats # rename to avoid suspected late-binding gotcha
stats = wrap_module_with_nan_policy(_stats)
stats.kruskal([1, 2, 3], [4, 5, 6], [np.nan, 8, 9])
# ValueError: The input contains nan values
stats.kruskal([1, 2, 3], [4, 5, 6], [np.nan, 8, 9], nan_policy='propagate')
# KruskalResult(statistic=nan, pvalue=nan)
Could you try that out and see if it meets your needs?