LinRegOutliers icon indicating copy to clipboard operation
LinRegOutliers copied to clipboard

Deepest Regression (DR) estimator

Open jbytecode opened this issue 5 years ago • 18 comments

It is good to have deepest regression estimator referenced with

Van Aelst, Stefan, et al. "The deepest regression method." Journal of Multivariate Analysis 81.1 (2002): 138-166.

in the package. Any contributions are welcome.

jbytecode avatar Sep 25 '20 19:09 jbytecode

Since the CATLINE estimator is for the linear model with a single predictor, it is not generalized for the multiple case, but it would be fantastic to have it is implemented in the package. The reference of the estimator is

Hubert, Mia, and Peter J. Rousseeuw. "The catline for deep regression." Journal of Multivariate Analysis 66.2 (1998): 270-296.

Any contributions are welcome.

jbytecode avatar Sep 25 '20 19:09 jbytecode

It is good to have deepest regression estimator referenced with

Van Aelst, Stefan, et al. "The deepest regression method." Journal of Multivariate Analysis 81.1 (2002): 138-166.

in the package. Any contributions are welcome.

Rousseeuw, Peter J., and Stefan Van Aelst. "An algorithm for deepest multiple regression." COMPSTAT. Physica, Heidelberg, 2000.

jbytecode avatar Sep 25 '20 19:09 jbytecode

R already have this functionality and can be used as a reference: https://cran.r-project.org/web/packages/DepthProc/DepthProc.pdf

jbytecode avatar Sep 25 '20 20:09 jbytecode

I @jbytecode, I would like to work on this topic

fmyilmaz avatar Oct 12 '20 16:10 fmyilmaz

Okay @fmyilmaz, very well, so what is your plan? Is it possible to write pure julia code without any dependency of R or C implementations?

jbytecode avatar Oct 12 '20 16:10 jbytecode

The R version of it contains a lot of Graph functions. So we should plan how to implement them using pure julia.

fmyilmaz avatar Oct 12 '20 16:10 fmyilmaz

Deepest regression (DR) have multiple methods implemented. The catline is only for single exploratory variable and it is not that necessary. As I remember, there is an exact algorithm for 2 exploratory variables. There is also a stochastic one for all dimensions. The methods in the corresponding R package calls C and Fortran code at the backend. Since DR algorithms are highly geometric ones, it is hard to implement them.

Lets start from basic:

  1. Calculating regression depth for any regression hyperplanes.
  2. Search regression estimates that maximizes it.

When the first goal is achieved, the second one is just an optimization problem and relatively easy.

If we got a function like

rdept(setting::RegressionSetting) = ...

then we can use a genetic algorithm or any other optimizer to search \hat{\beta_i} for i = 1, 2, ..., p

@fmyilmaz

jbytecode avatar Oct 12 '20 16:10 jbytecode

Thanks! ı will contact with you to select which opt should we use.

fmyilmaz avatar Oct 12 '20 16:10 fmyilmaz

@tantei3 can I consult your opinions on a subject:

Since the regression depth and the deepest regression estimators are difficult to implement and there is a vast amount of legacy code around, today, I examined the Fortran code of the Deepest Regression estimator in the R package mrfDepth hosted in the read-only repository here. I compiled the Fortran codes shared in the src directory using

$ gfortran -shared -fPIC *.f

in the Mac Os terminal. Supposing the library is a.out, it can be callable in Julia using

X = convert(Matrix, stackloss) 
n, p = size(X)
n = Int32(n)
p = Int32(p)
A = zeros(Float64, p)
maxit = Int32(10000)
iter = Int32(1)
MDEPAPPR = Int32(3)
result = ccall((:sweepmedres_, "./a.out"), 
    Cint, 
    (Ref{Float64},      # X
        Ref{Int32},           # n
        Ref{Int32},           # np
        Ref{Float64},   # A
        Ref{Cint},           # maxit 
        Ref{Cint},           # iter
        Ref{Cint}            # MDEPAPPR
        ), X, n, p, A, maxit, iter, MDEPAPPR)

println(A)

the data stackloss is the data set in our package. The output is

[0.8252212389746946, 0.44247787604338223, -0.0796460177080886, -35.37610619462]

and the same output is obtained in the R as

R> rdepthmedian(maxit = 10000, x = stackloss)
$deepest
   intercept slope var. 1 slope var. 2 slope var. 3 
-35.37610619   0.82522124   0.44247788  -0.07964602 

except the intercept is the last term in Julia output whereas it is the first term in R. This is not a problem.

Finally, it seems the Fortran code is directly useful, however the problem is to

  • how to pack and integrate this stuff in the package LinRegOutliers
  • if compiled code is required, what is your opinion for this because it should be compiled for at least three systems Windows, Linux and MacOs

Since Julia package manager does not compile C or Fortran code, precompiled ones must be supported by the package maintainers.

What do you think about this?

jbytecode avatar Oct 16 '20 10:10 jbytecode

the package GLMNet is using fortran code. Here is the link. They put the fortran code in directory /deps and just ccall.

jbytecode avatar Oct 16 '20 10:10 jbytecode

I don't have prior experience with this, but to me it looks like https://julialang.github.io/Pkg.jl/v1/artifacts/ page mentions about downloading packages and binaries. So maybe in the CI during a release, it can also generate the Fortran/C static/dynamic libraries for the different supported Platforms and host it on Github, and during the package install, depending on the platform download the corresponding package and setup?

But I think it will get complicated building all Julia supported platforms with different architectures like Arm, x86, along with different OS. So if the amount of code is not high enough, would be better to port to Julia slowly in my view.

tantei3 avatar Oct 16 '20 10:10 tantei3

Actually GLMNet it looks like you need to generate the glmnet_jll binary package using Yggdrasil BinaryBuilder.jl I think. Here is the corresponding link to the build recipe https://github.com/JuliaPackaging/Yggdrasil/blob/master/G/glmnet/build_tarballs.jl

tantei3 avatar Oct 16 '20 11:10 tantei3

yes, it is too much complicated and translating the code is the most secure method. The most important file is the

RegresDepthDeepest.f

and it depends many other fortran files.

jbytecode avatar Oct 16 '20 11:10 jbytecode

Hi @tantei3 ,

I found a bucket of code in Julia's GitHub repository, in file here:

The code

import Base.llvmcall

function f(x, y)
    llvmcall("""%3 = add i64 %1, %0
                    ret i64 %3""",
            Int64,
            Tuple{Int64,Int64},
            x,
            y)
end


result = f(5, 6)
println(result)

perfectly produces the number 11. Here

%3 = add i64 %1, %0
ret i64 %3

is the LLVM IR code. What do you think about shipping IR code for required Fortran functions using a LLVM Fortran compiler such as FLang or others?

jbytecode avatar Oct 21 '20 17:10 jbytecode

I think if it is done correctly then it should be portable and robust since Julia should take care of handling LLVM IR since it is also built on top of LLVM. Integrating Julia + LLVM IR might take some effort since all the fortran functions would need to be translated to LLVM IR and then wrapped in Julia function. Manually doing it would seem painful for Fortran code with many functions. If there is some tool to generate Julia wrapper functions from LLVM IR, it might be best.

tantei3 avatar Oct 22 '20 04:10 tantei3

@tantei3 how about shipping the fortran code within the package, trying to compile them in the installation process or before loading first time? If the compiler is absent, when the user calls dr() we can print a message "You have not Fortran compiler installed, If you want to make this function active, please install a fortran compiler and then run the installbinaries() function" ?

jbytecode avatar Oct 22 '20 10:10 jbytecode

Somebody suggested JuliaPackaging/Binary Builder on Twitter. Link is here

jbytecode avatar Oct 22 '20 10:10 jbytecode

Requiring Fortran compiler in the use's machine doesn't seem nice to run Julia code. I think the BinaryBuilder method might be easiest at the moment (although still don't know the details). It should take care of making the binary available compatible with all platforms and just need to include the binary as a dependency.

tantei3 avatar Oct 22 '20 12:10 tantei3

The Fortran code in that R package is compiled for all targets: https://github.com/JuliaPackaging/Yggdrasil/issues/7224#event-10161803077

Now mrfDepth_jll is ready for possible implementation of deepestregression().

jbytecode avatar Aug 22 '23 19:08 jbytecode