FSharp.Stats icon indicating copy to clipboard operation
FSharp.Stats copied to clipboard

Goodness of fit functions for OrdinaryLeastSquares.Linear.Multivariable?

Open nhirschey opened this issue 3 years ago • 4 comments

Is your feature request related to a problem? Please describe. I am following the goodness of fit quality tutorial. I want t-statistics and standard errors for coefficients from regressions with multiple independent variables. Are there functions to do this already?

The multivariate fit function (see https://fslab.org/FSharp.Stats/Fitting.html#Multivariable) has type of x:Vector<float> -> float but GoodnessOfFit.calculateSumOfSquares expects float -> float. It appears that there is not a "multivariable" version.

nhirschey avatar Apr 19 '21 19:04 nhirschey

Currently there is no implementation of calculateSumOfSquares for multivariate regression available. I think the function could either be generalized to accept multi-dimensional input (calculateSumOfSquares (fitFunc: 'T -> float) (xData : 'U) (yData : 'T) or a specialized function (e.g. calculateSumOfSquaresMultivariate) that accepts matrices and vectors. The function naming maybe should be shortened.

While the first option may lead to an ambiguous signature that is difficult to interpret, the second option adds an very similar function to the module.

Do you prefer one of these options or have another idea?

bvenn avatar Apr 19 '21 20:04 bvenn

My preference is for there to be one calculateSumOfSquares function that operates on regressions regardless the number of parameters.

But this is part of a bigger comment (speaking to this https://github.com/fslaborg/FSharp.Stats/issues/94). I haven't understood why there is 1 API entry point for ols regressions with 1 feature and another for ols regressions with > 1 features. The multivariable functions would produce the same results as the univariable ones if you use an Nx1 matrix (N observations * 1 parameter) instead of an N-length vector as x. The math should be the same; is it there for some computational reason to allow better performance when there is 1 feature?

To me (and perhaps this is just me coming from a different discipline) it overcomplicates the API surface.

nhirschey avatar Apr 20 '21 17:04 nhirschey

The current structure has emerged from our every-day data analysis work and was influenced by the chronological order it was implemented. I absolutely agree that a generic function is missing and might be straight forward to implement. Nevertheless, it is an easier and frustration-free entry point for our students to begin with specialized functions with clear signatures to not confuse e.g. matrix orientations.

Long story short, it would be great to have a generic implementation for calculateSumOfSquares, that may have specialized functions set up on top of it afterwards.

For determining significances of regression coefficients, there are F test statistics available at Testing. TestStatistics.FTestStatistics. In the process of renewing the Fitting module (#94) we aim to reduce the highly branched module structure. A generic function to test the coefficients (as in GoodnessOfFit.ttestIntercept for univariable simple linear regression) could then be introduced.

bvenn avatar Apr 21 '21 11:04 bvenn

Thanks for the context. I fully support your library being optimized for your needs and priorities. And I agree that clear signatures are nice. Maybe calculateSumOfSquares.Multivariable ... Anyway more big picture thoughts in my comment on #94, but also keep in mind my comments there are in the spirit of idea generation. You are the developers and your guys' needs are the priority.

Thank you for pointing me to the Testing code. I will check that out.

nhirschey avatar Apr 21 '21 15:04 nhirschey