FSharp.Stats
FSharp.Stats copied to clipboard
Renew fitting module structure
Description
The Fitting module currently is fragmented into modules of diverse complexity. The usage of one of the most used regression types (simple linear regression) is hidden under several module layers (Fitting.LinearRegression.OrdinaryLeastSquares.Linear.Univariable
). While the naming is correct, a cleaned up structure would simplify its usage.
Since univariable least squares is the default regression method in linear regression, it may be an option to modify the structure that linear regression can be found at Fitting.LinearRegression.Linear
and Fitting.LinearRegression.Polynomial
. Other fitting methods may be modularized as they are. E.g.:
Fitting.LinearRegression.OrdinaryLeastSquares.Linear.RTO
-> Fitting.LinearRegression.Linear.RTO
Fitting.LinearRegression.RobustRegression.Linear
-> Fitting.LinearRegression.RobustRegression.Linear
I fear however, that this proposed version isn't perfect, so if someone has a good idea, please let me know.
I completely agree that the module would benefit from a overhaul. However, on the other hand I really enjoy the correct naming in this lib. It can be somehow tedious in other libs to find out what kind of method is used and sometimes simplification leads to simply incorrect naming.
The nested modules could be accompanied by something like this:
module LinearRegression =
...
module OrdinaryLeastSquares =
...
type FittingMethod =
| Linear
| Polynomial of int
let coefficient fittingMethod =
match fittingMethod with
| Linear -> OrdinaryLeastSquares.Linear.Univariable.coefficient
| Polynomial o -> OrdinaryLeastSquares.Polynomial.coefficient o
This keeps the "correct" structure but instead of navigating down one can perform the most frequently used method by:
FSharp.Stats.Fitting.Linearegression.coefficient FittingMethod.Linear
One thing I frequently stumble accross is the function naming. Maybe we can exchange "coefficient" with "fit" and replace "fit" with "predict". This would be in canon with the naming used in other .net libs e.g. MLnet.
Some rough sketches. My comments here are heavily influenced by what I want for my work and what I have found to work well in e.g., R. But overall, my comment is that the API can steal some ideas from the interfaces in R, python, stata, matlab, julia, etc. Part of this is simply that the library is new, but in the spirit of sharing ideas here it goes:
I think it is important to make it easy to:
- Apply multiple models to the same data using different explanatory variables from a dataset. Using model formulas (like R's formula interface) is nice for this, and using F# functions for this could work well.
- Apply different variance/ covariance matrix estimates to the same fit to check different standard error calculations. R's sandwich package has a very good flexible interface for this.
- Produce nice output tables for fsi, html, and latex outputs like msummary. This twitter thread shows impressive capabilities.
Other thoughts:
- I know that 2 and 3 are sufficiently big to be independent projects; just pointing out as stretch goals and to keep in mind making the fit infrastructure easy for people to build libraries adding 2 and 3 on top.
- I think that "ols" is common enough to go with the abbreviation vs. OrdinaryLeastSquares. But this is not so important.
- I like Zimmerd's suggestions of coefficients instead of fit (fit to me seems like coefficients + other stuff) and predict instead of fit for getting predicted/fitted values.
A rough sketch heavily influenced by R lm function, the sandwich standard errors package, and msummary summary package:
open FSharp.Stats.LinearRegression
// formula outputs y and x [] for each array item.
let OrdinaryLeastSquares (data:'T []) (formula: 'T [] -> (float * float []) []) =
... fitting code
// for example
let formula1 (x: MyRecord array) = x.Y, [| x.Var1; x.Var2; x.Var3**2.0 |] // multivariable
let formula2 (x: MyRecord array) = x.Y, [|x.Var1|] // univariable
let formula3 (x: MyRecord array) = x.Y, polynomial(x.Var1, order=3) // generic polynomial function
let modelFormulas = [formula1; formula2; formula3]
let vcovFunctions = [vcovHC; vcovHAC] // these have different variance-covariance matrix functions
modelFormulas
|> List.map (OrdinaryLeastSquares myData)
|> List.collect(fun fit -> vcovFunctions |> List.map fit)
|> List.map modelSummary // a customizable report with coefficients, r^2, t-stats, p-values, etc.
Also, given we might need many "options" for OrdinaryLeastSquares maybe a class with optional parameters makes sense? Though a function with pipeable options (like Plotly.NET) could work too.
I was surprised to realize yesterday that the ML.NET api is not too far from what I'd like, though still a bit clunky and I suspect harder to extend (I've been really impressed with FSharp.Stats overall and how easy it was for me to actually contribute a bit of code to the project).
#r "nuget:Microsoft.ML,1.5"
#r "nuget:Microsoft.ML.MKL.Components,1.5"
open Microsoft.ML
open Microsoft.ML.Data
let ctx = new MLContext()
let dta = ctx.Data.LoadFromEnumerable<MyRecord>(myDataArray)
let trainer = ctx.Regression.Trainers.Ols(labelColumnName="Y",featureColumnName="Features")
let model =
EstimatorChain()
.Append(ctx.Transforms.Concatenate("Features",[|"X1";"X2"|]))
.Append(trainer)
let estimatedModel = dta |> model.Fit
estimatedModel.LastTransformer.Model
Just listing some thoughts for remodeling:
- The central element of the linear regression module should be multivariable polynomial regression. From here on simple linear regression as well as univariable linear regression can be derived easily.
-
coeff
is replaced withfit
andfit
is replaced withpredict
(Therefore ther is no possibility to use[<Obsolete()>]
tags). - The original structure can be preserved and additional fit and predict functions aggregating all linear regressions described above are created.
- Robust regression and constrained simple linear regression have to be integrated or preserved as stand-alone modules.
-
GoodnessOfFit
functionalities should be extended to multivariate regression.