scikit-dimension
scikit-dimension copied to clipboard
Incorrect implementation of Participation Ratio
I just noticed what I think might be a problem with your implementation of Participation Ratio!
Suppose one is given a matrix X
with shape (num samples, num features)
. The PR is given by square(sum(eigenvalues)) / sum(square(eigenvalues))
of X^T X
, but the implementation uses PCA
.
https://github.com/scikit-learn-contrib/scikit-dimension/blob/1ddee0f7d4bfdb9ee44ada956a5c7c55950248b0/skdim/id/_PCA.py#L149
This is incorrect because PCA
first demeans the data. I don't think that's correct.
I couldn't find the original reference of participation ratio, maybe you know it ? It seems usual practice to refer to PR of eigenvalues of the covariance matrix. But I can add an option to return eigenvalues without centering if you need it for your application, thanks for the suggestion!
I also double checked, and I'm not sure either. Regardless, an optional flag would be awesome, thank you!