CovarianceEstimation.jl
CovarianceEstimation.jl copied to clipboard
Dealing with missing values
Probably for a future point:
julia> X = AbstractArray{Union{Float64, Missing}, 2}(randn(5, 7))
julia> X[1, 2] = missing
julia> X[3, 5] = missing
julia> cov(X)
7×7 Array{Union{Missing, Float64},2}:
0.323781 missing -0.235777 0.0266937 missing 0.460899 0.345166
missing missing missing missing missing missing missing
-0.235777 missing 1.44032 -1.2644 missing 0.39682 -0.442537
0.0266937 missing -1.2644 1.69334 missing -0.367602 -0.374397
missing missing missing missing missing missing missing
0.460899 missing 0.39682 -0.367602 missing 1.74075 0.614322
0.345166 missing -0.442537 -0.374397 missing 0.614322 2.00857
I don't think that's ideal (using both Statistics and StatsBase). See also covrob r package where a function to filter missing value can be provided.
It would seem pretty easy to at least implement
- fail if there are missing
- omit if there are missing (remove the corresponding obs)
And then maybe we could suggest imputing maybe via Impute.jl
refs
- https://arxiv.org/pdf/1201.2577.pdf
- https://icml.cc/2012/papers/313.pdf
There are also algorithms designed specifically to deal with missing data, for example: https://arxiv.org/pdf/1201.2577.pdf .
Ok so that's a Lasso-type problem on a slightly modified observed covariance (eq (1.5)). I guess that can be added once we've added a (Graphical) Lasso estimator for the covariance.
Consider exporting a shrinkage method that relies on the matrix S, but not the underlying matrix of samples, X (I note that analytical_nonlinear_shrinkage appears to use only S, and not X). The motivation here is that in stock data there are typically missing samples, so a matrix, X, cannot be fully constructed. Instead, pairwise covariances can be calculated to form the elements of a matrix, T (though T is not guaranteed positive semidefinite as its elements are computed on inconsistent data sets).
Then, consider adding the method described here: https://nhigham.com/2013/02/13/the-nearest-correlation-matrix/ (there is already sample code in Matlab/R/Python). Then, T can be "converted" to a positive semidefinite matrix, S, that can then be fed into analytical_nonlinear_shrinkage.
This looks like a good approach, I could review and merge a pull request that adds this. I don't personally need this functionality at the moment so I'm not going to work on it myself.