CovarianceEstimation.jl icon indicating copy to clipboard operation
CovarianceEstimation.jl copied to clipboard

Dealing with missing values

Open tlienart opened this issue 6 years ago • 4 comments

Probably for a future point:

julia> X = AbstractArray{Union{Float64, Missing}, 2}(randn(5, 7))
julia> X[1, 2] = missing
julia> X[3, 5] = missing
julia> cov(X)
7×7 Array{Union{Missing, Float64},2}:
  0.323781   missing  -0.235777   0.0266937  missing   0.460899   0.345166
   missing   missing    missing    missing   missing    missing    missing
 -0.235777   missing   1.44032   -1.2644     missing   0.39682   -0.442537
  0.0266937  missing  -1.2644     1.69334    missing  -0.367602  -0.374397
   missing   missing    missing    missing   missing    missing    missing
  0.460899   missing   0.39682   -0.367602   missing   1.74075    0.614322
  0.345166   missing  -0.442537  -0.374397   missing   0.614322   2.00857 

I don't think that's ideal (using both Statistics and StatsBase). See also covrob r package where a function to filter missing value can be provided.

It would seem pretty easy to at least implement

  • fail if there are missing
  • omit if there are missing (remove the corresponding obs)

And then maybe we could suggest imputing maybe via Impute.jl

refs

  • https://arxiv.org/pdf/1201.2577.pdf
  • https://icml.cc/2012/papers/313.pdf

tlienart avatar Jan 10 '19 03:01 tlienart

There are also algorithms designed specifically to deal with missing data, for example: https://arxiv.org/pdf/1201.2577.pdf .

mateuszbaran avatar Jan 10 '19 22:01 mateuszbaran

Ok so that's a Lasso-type problem on a slightly modified observed covariance (eq (1.5)). I guess that can be added once we've added a (Graphical) Lasso estimator for the covariance.

tlienart avatar Jan 10 '19 23:01 tlienart

Consider exporting a shrinkage method that relies on the matrix S, but not the underlying matrix of samples, X (I note that analytical_nonlinear_shrinkage appears to use only S, and not X). The motivation here is that in stock data there are typically missing samples, so a matrix, X, cannot be fully constructed. Instead, pairwise covariances can be calculated to form the elements of a matrix, T (though T is not guaranteed positive semidefinite as its elements are computed on inconsistent data sets).

Then, consider adding the method described here: https://nhigham.com/2013/02/13/the-nearest-correlation-matrix/ (there is already sample code in Matlab/R/Python). Then, T can be "converted" to a positive semidefinite matrix, S, that can then be fed into analytical_nonlinear_shrinkage.

rumela avatar Sep 05 '20 18:09 rumela

This looks like a good approach, I could review and merge a pull request that adds this. I don't personally need this functionality at the moment so I'm not going to work on it myself.

mateuszbaran avatar Sep 06 '20 09:09 mateuszbaran