posthoc_tukey yields p-values of 0.900 and 0.100 which are also different to scipy.stats.tukey_hsd

Open DavidTadres opened this issue 2 years ago • 1 comments

First, thank you very much for this very useful package.

Describe the bug scikit_posthocs.posthoc_tukey gives unexpected result of '0.001' and '0.900' with one dataset (see below).

scipy.stats.tukey_hsd gives similar but not identical numbers.

Note that the groups don't have the identical n.

Dataset see code below

To Reproduce

# bug report
import scikit_posthocs as sp
import scipy
import pandas as pd

data_a = [0.08331362, 0.22462052, 0.44619224, 0.34004518, 0.03146107,
           0.15828442, 0.27876282, 0.14699693, 0.3870986 , 0.33669976,
           0.38822324, 0.28127964, 0.04101782, 0.31787209, 0.20165472,
           0.40043812, 0.50580976, 0.20009951]

data_b = [ 0.14693014,  0.0055596 ,  0.19977264, -0.30859794,  0.017286  ,
            0.05342739, -0.09502465,  0.01998256,  0.06162499,  0.18634389,
            0.34667326,  0.06702727,  0.14268381,  0.13141426,  0.06344518,
            0.04185783,  0.18701589, -0.06134188,  0.02844774]

data_c = [ 0.14727163,  0.10290732, -0.09934048,  0.06231107, -0.06754609,
            0.04739071,  0.19232889,  0.03198218,  0.11590822,  0.08816257,
            0.05692482,  0.04922897, -0.06524353,  0.08966288,  0.12975986,
           -0.08346692,  0.02827149,  0.15724036,  0.05327535]

all_data = [data_a, data_b, data_c]
labels = ['a', 'b', 'c']

df = pd.DataFrame()
for i in range(3):
    cur_dict = {'Group': [labels[i]] * len(all_data[i]),
                'Data': all_data[i]}
    
    cur_df = pd.DataFrame(cur_dict)
    
    df = pd.concat([cur_df, df],
                                     ignore_index=True)
    df.reset_index()
    
print(sp.posthoc_tukey(df,
                       val_col='Data',
                       group_col='Group'))

This yields:

       c      b      a
c  1.000  0.900  0.001
b  0.900  1.000  0.001
a  0.001  0.001  1.000

Which seem very unlikely values, right?

In contrast, the scipy library yields

print(scipy.stats.tukey_hsd(data_a, data_b, data_c))

Tukey's HSD Pairwise Group Comparisons (95.0% Confidence Interval)
Comparison  Statistic  p-value  Lower CI  Upper CI
 (0 - 1)      0.200     0.000     0.103     0.297
 (0 - 2)      0.210     0.000     0.114     0.307
 (1 - 0)     -0.200     0.000    -0.297    -0.103
 (1 - 2)      0.010     0.963    -0.085     0.106
 (2 - 0)     -0.210     0.000    -0.307    -0.114
 (2 - 1)     -0.010     0.963    -0.106     0.085

Expected behavior I am not sure what the correct result it but it seems unlikely that the resulting p-value is '0.001' for two comparisons.

Also, it's unclear why the tukey test of scikit-posthocs gives a different result compared to the scipy version.

System and package information (please complete the following information):

OS: Window 10 Pro
Package version:
- scikit-posthocs 0.7.0 pyhd8ed1ab_0 conda-forge
- scipy 1.10.1 py310h309d312_1

Additional context Other datasets give different results with more plausible p-values such as


           a           b           c
a       1.000000    0.409612     0.001678
b       0.409612    1.000000     0.053077
c       0.001678     0.053077    1.000000

Jun 21 '23 23:06 DavidTadres

Hi! Thank you for reporting this. You should be using posthoc_tukey_hsd function instead of posthoc_tukey. But I do not like how both of them are implemented.

Scipy's function was added much later than this, now we can reimplement pairwise Tukey test based on it.

Jun 22 '23 04:06 maximtrp