GLM.jl Linear separability diagnostics?

One thing I'd really like is for Julia to tell the user when the data is linearly separable under a logistic model. This could be done by making a call to glm for logistic models terminate with a call to predict to see if there are no mispredicted responses. In that case, it would be nice to output a message noting this.

Jan 11 '13 23:01 johnmyleswhite

Konis, Kjell. 2007 thesis has a survey of various practical methods and explains the approach taken in R's safeBinaryRegression.

Apr 06 '18 03:04 Nosferican

Thanks for the reference. I think that, ideally, the check could a post-processing function. Potentially run as part of the coeftable function.

Apr 06 '18 07:04 andreasnoack

I thought the main consideration would be to have the detection work during the fitting process and deal with it (e.g. drop covariates, drop observations, issue warning, early stop iteration, etc.) This approach is the one Stata uses which sequentially drops covariates / observations until the separability disappears. If it isn't possible it issues an error of no valid observations.

Apr 06 '18 07:04 Nosferican

I wouldn't be in favor of too much magic happening automatically. I'd rather provide the tools to diagnose this and let the user adjust the model. I also wouldn't be in favor of slowing down the fitting procedure. You might only be interested in prediction or parameters not affected by the separation.

Apr 06 '18 07:04 andreasnoack

The methods outlined take into consideration the additional computational expense incurred. I recently implemented O’Leary (1990) IRLS QR Newton (which might be one the DenseQR methods here?) for developing a few routines missing in GLM which I could use to verify the computational cost of adding those. It would not apply to all models, but those that are "unsafe", but I agree that warnings in this case might be preferred to a non-specified handling method. Linear separability seems trickier than just a non-full rank matrix which I am totally fine with automatically making it full rank and letting the user know. As for development, I think the safe-binary algorithms could be developed in a separate package and used in GLM. It might be nice to have the IRLS methods moved to a solver package too and called from GLM. Those can be optimized for Dense, Sparse, Mixed, and Distributed cases (see Kane and Lewis working notes). I mentioned this since StatsModels moves to allow other tabular data packages with different capabilities from DataFrames (Slack#Data). If this is something to consider I can move that discussion to a different to limit this one to the linear separability.

Apr 06 '18 08:04 Nosferican