pingouin icon indicating copy to clipboard operation
pingouin copied to clipboard

rm_corr error LinAlgError: SVD did not converge

Open shoffm opened this issue 1 year ago • 7 comments

Hello,

I am trying to run rm_corr on multiple columns of a dataframe (gene expression data), and while the function works well on many column pairs (and match the expected output from the R rmcorr function) one pair of columns throws an error LinAlgError: SVD did not converge. However, this pair of columns has no trouble running in the R rmcorr implementation. Therefore, I am curious what the difference between the two implementations is, and whether it is possible to get this to converge in python? I would prefer to continue to use your implementation as it seems to be much faster (I am in the midst of benchmarking how both implementations scale, so as a side note if you have any data on that I would very much appreciate it!).

I am using Pingouin v.0.5.5. I have attached a minimal dataframe to recreate my error, along the with the code below:

# load dataframe as dataframe
import pingouin as pg
pg.rm_corr(data = dataframe, x = "Gene1", y = "Gene2", subject = "Subject") 

Dataframe: df_pingouin_fail.csv

Thanks so much for your help! Best, Sophie

shoffm avatar Sep 10 '24 15:09 shoffm

Hi Sophie,

Thanks for opening the issue. The Pingouin implementation is based on an ANCOVA model that is implemented with statsmodels.

I am not able to reproduce the error on my machine:

image

I am using Python 3.9 with statsmodels 0.14. What versions of Python, pingouin, and statsmodels are you using?

Second, I noticed that your datasets includes many subjects with only 1 or 2 observations. Do you still get the error if you remove these participants from the dataset?

image

Thanks, Raphael

raphaelvallat avatar Sep 14 '24 12:09 raphaelvallat

Hi Raphael,

Thanks so much for your reply. I am using the following versions (on Linux):

statsmodels 0.14.0 python 3.9.0 pingouin 0.5.5

Which version of pingouin are you using when you don't reproduce the error?

Thanks so much! Sophie

shoffm avatar Sep 16 '24 10:09 shoffm

Hi,

I am using pingouin 0.5.5 (on Mac), pandas 2.2.2, numpy 1.26.4, statsmodels 0.14.0

raphaelvallat avatar Sep 17 '24 09:09 raphaelvallat

Hi Raphael, I was wondering if the LAPACK dependency of numpy could cause the issue. Would you mind running numpy.show_config() and sharing the results? Many thanks, Eric

Eric-Kobayashi avatar Sep 17 '24 16:09 Eric-Kobayashi

Sure thing @Eric-Kobayashi:

Build Dependencies:
  blas:
    detection method: pkgconfig
    found: true
    include directory: /usr/local/include
    lib directory: /usr/local/lib
    name: openblas64
    openblas configuration: USE_64BITINT=1 DYNAMIC_ARCH=1 DYNAMIC_OLDER= NO_CBLAS=
      NO_LAPACK= NO_LAPACKE= NO_AFFINITY=1 USE_OPENMP= SANDYBRIDGE MAX_THREADS=3
    pc file directory: /usr/local/lib/pkgconfig
    version: 0.3.23.dev
  lapack:
    detection method: internal
    found: true
    include directory: unknown
    lib directory: unknown
    name: dep4548835888
    openblas configuration: unknown
    pc file directory: unknown
    version: 1.26.4
Compilers:
  c:
    args: -fno-strict-aliasing
    commands: clang
    linker: ld64
    linker args: -fno-strict-aliasing
    name: clang
    version: 14.0.0
  c++:
    commands: clang++
    linker: ld64
    name: clang
    version: 14.0.0
  cython:
    commands: cython
    linker: cython
    name: cython
    version: 3.0.8
Machine Information:
  build:
    cpu: x86_64
    endian: little
    family: x86_64
    system: darwin
  host:
    cpu: x86_64
    endian: little
    family: x86_64
    system: darwin
Python Information:
  path: /private/var/folders/kx/gw6dssyn19d9qjs9mvh4hkz80000gn/T/cibw-run-5dj358b1/cp39-macosx_x86_64/build/venv/bin/python
  version: '3.9'
SIMD Extensions:
  baseline:
  - SSE
  - SSE2
  - SSE3
  found:
  - SSSE3
  - SSE41
  - POPCNT
  - SSE42
  - AVX
  - F16C
  - FMA3
  - AVX2
  not found:
  - AVX512F
  - AVX512CD
  - AVX512_KNL
  - AVX512_SKX
  - AVX512_CLX
  - AVX512_CNL
  - AVX512_ICL

raphaelvallat avatar Sep 17 '24 18:09 raphaelvallat

Hi Raphael,

Thanks for providing these information. It turns out not to be a version issue but it might be a deeper problem with the numpy.linalg.pinv function.

I was able to successfully run the rm_corr function after shuffling the dataframe. I've simulated reshuffling many times and found there is around 8.12% of failure to converge. On the other hand, would you mind testing the same and see if you replicate the issue?

pg.rm_corr(data = dataframe.sample(len(dataframe)), x = "Gene1", y = "Gene2", subject = "Subject")

Eric-Kobayashi avatar Sep 26 '24 13:09 Eric-Kobayashi

Hmm, very strange behavior indeed. I can replicate the error: 5 out of 100 run of the function on resampled data failed (5% failure).

raphaelvallat avatar Oct 04 '24 07:10 raphaelvallat