mgwr icon indicating copy to clipboard operation
mgwr copied to clipboard

LinAlgError: Matrix is Singular

Open scardonau94 opened this issue 3 years ago • 15 comments

Hello everyone I am trying to fit a GWR model. I am following examples codes and each has the same pipeline. When I measure the "gwr_selector" an issue related to LingAlgError: Matriz is Singular appears. I have 2941 polygons and 20 variables to fit the model. The unique way codes work is to fit them with 150 polygons and 5 variables. Do you know what kind of mistake I am making? Bests

image

scardonau94 avatar May 19 '22 14:05 scardonau94

Yes. I think that bw_min is probably too small. If there are not enough observations, the least squares procedure will not be able to invert a local g_X. Here, you've defined this to use, at a minimum, two observations in each local model. If this is the case, then XtX will be singular when the two observations have the same value for any of the variables.

So, try increasing bw_min.

Does this not work by fitting to the full dataset of 2941 polygons and a larger bw_min?

ljwolf avatar May 19 '22 14:05 ljwolf

Thank you for your quick answer. I tried different combinations of bw_min and It was not possible to solve the problem.

scardonau94 avatar May 19 '22 16:05 scardonau94

Then it may be related to your model specification. Is there any variable that is perfectly collinear?

Get Outlook for iOShttps://aka.ms/o0ukef


From: Santiago Cardona Urrea @.> Sent: Thursday, May 19, 2022 5:21:48 PM To: pysal/mgwr @.> Cc: Levi John Wolf @.>; Comment @.> Subject: Re: [pysal/mgwr] LinAlgError: Matrix is Singular (Issue #116)

This message could be from someone attempting to impersonate a member of UoB. Please do not share information with the sender without verifying their identity. If in doubt, please contact the IT Service Desk for advice. --

Thank you for your quick answer. I tried different combinations of bw_min and It was not possible to solve the problem.

— Reply to this email directly, view it on GitHubhttps://github.com/pysal/mgwr/issues/116#issuecomment-1131926687, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AARFR45TAPWCPTEMJKEIPWTVKZTBZANCNFSM5WMMW3PA. You are receiving this because you commented.Message ID: @.***>

ljwolf avatar May 19 '22 16:05 ljwolf

I have checked colinearity but I did not find variables perfectly colinear.

scardonau94 avatar May 19 '22 21:05 scardonau94

You checked local collinearity? This can change at each different bandwidth that is explored when you are using the bandwidth search procedure.

On Thu, May 19, 2022 at 5:09 PM Santiago Cardona Urrea < @.***> wrote:

I have checked colinearity but I did not find variables perfectly colinear.

— Reply to this email directly, view it on GitHub https://github.com/pysal/mgwr/issues/116#issuecomment-1132210548, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB553TPTT76D3743ULVXX53VK2UZVANCNFSM5WMMW3PA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

TaylorOshan avatar May 19 '22 21:05 TaylorOshan

I have had this problem before and found a workaround, not sure if it's a "valid" approach or not:

If the variable that is causing you trouble is a floating point value, you might be able to get away with adding a little bit of random "dust" to it. For instance, I had a particular variable that for 80% of my observations was in the range of 1,000-10,000. But for about 20% of the observations, this variable is flat zero. The flat zeros were causing the issue if they happened to be the only ones in a particular bandwidth range, or so I surmised.

So my solution was to add "dust" to all the variable values. A random amount between 0.00-0.99. My final values will all be rounded to the nearest whole number anyways.

Adding the "dust" makes it so that all the troublesome parcels with the zero value are now technically different from one another. And hopefully the amount on them is so small that it won't meaningfully affect the predictions.

larsiusprime avatar Jul 28 '23 04:07 larsiusprime

Hi @larsiusprime! that's a reasonable way to avoid the singularity issue if you can afford that small bit of random noise in your analysis budget. For most, adding a random value somewhere between [0,1e-4] is probably sufficient.

For any potential developer interested in solving this in our code, the solution would be to swap our current numpy.linalg.inv() to a pseudo-inverse, like pinv_extended() in statsmodels. See, for example, a regression that fits on a perfectly collinear input:

>>> import statsmodels
>>> import numpy
>>> from statsmodels import api as sm
>>> x = numpy.random.random(size=100)
>>> X = numpy.column_stack((numpy.ones_like(x), x, x,)) # perfectly collinear columns 2 & 3
>>> y = X @ numpy.array([[3, -2, 4])).T 
>>> sm.OLS(endog=y, exog=X, hasconst=True).fit().summary()
"""
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 1.130e+31
Date:                Fri, 28 Jul 2023   Prob (F-statistic):               0.00
Time:                        09:51:05   Log-Likelihood:                 3265.6
No. Observations:                 100   AIC:                            -6527.
Df Residuals:                      98   BIC:                            -6522.
Df Model:                           1
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          3.0000   3.66e-16   8.21e+15      0.000       3.000       3.000
x1             1.0000   2.97e-16   3.36e+15      0.000       1.000       1.000
x2             1.0000   2.97e-16   3.36e+15      0.000       1.000       1.000
==============================================================================
Omnibus:                        6.528   Durbin-Watson:                   0.115
Prob(Omnibus):                  0.038   Jarque-Bera (JB):                8.807
Skew:                           0.267   Prob(JB):                       0.0122
Kurtosis:                       4.352   Cond. No.                     1.36e+17
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 9.07e-33. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
"""

ljwolf avatar Jul 28 '23 08:07 ljwolf

Any idea if there will be a fix to this? The program essentially doesn't work and always results in this error.

I had this error 3 years ago, came back to the same project and the error is still present.

jagreen1 avatar Dec 20 '23 13:12 jagreen1

@jagreen1 there's an ongoing work to fix this in https://github.com/pysal/mgwr/pull/134

martinfleis avatar Dec 20 '23 13:12 martinfleis

Good to hear. I have read through that thread, but don't have sufficient expertise myself to contribute. I'm essentially lost as how to progress with a project unless this can be fixed. Cheers.

jagreen1 avatar Dec 20 '23 13:12 jagreen1

@jagreen1 Can you fit a regular OLS on your data?

from spreg import OLS
OLS(y, x)

If you cannot fit an OLS, then the problem is not with MGWR (#132).

If you can, one way that frequently works is to increase the minimum bandwidth.

This LinAlgError can arise because the "local" model near a given site has all the same values for some feature. Forcing the bandwidth larger prevents this while also preventing overfitting.

You can see that the original poster of this issue is setting min_bw=2, which will pretty much always fail if there's a categorical/one-hot encoded feature.

ljwolf avatar Dec 20 '23 14:12 ljwolf

Yes, there are no problems with the OLS. I'm using a logistic/binomial (0 or 1) dataset of 100k points, and a further subset of just 20k points. It works with some bandwidths and indiscriminately not with others.

jagreen1 avatar Dec 20 '23 16:12 jagreen1

Interesting, OK. And, to confirm, the issue arises in Sel_BW()?

Do you have any categorical/one-hot features, or are they all continuous?

This is something I've long been interested in conceptually... I hope to have the proof of concept linked above completed by early Jan.

ljwolf avatar Dec 20 '23 16:12 ljwolf

@ljwolf Yes, I can confirm that this issue occurs during sel_bw, in my case for a binomial regression model.

The independent variables are continuous (not categorical), however where data wasn't available I had to assign values of zero. Not sure if that causes an issue.

I have primarily been using the MGWR GUI application, which often has the error LinAlgError: Matrix is Singular. This occurs for seemingly random bandwidths. For example, for one dataset I tried running the GWR analysis for bandwidths 1770 to 1780 at intervals of 1. The regression ran for a bandwidth of 1780, but not for 1770 through 1779.

MGWR_Error_1 MGWR_Error_3

I decided to try analyzing the data purely in python (not using the GUI), and I now receive a slightly different error when calling sel_bw being IndexError: invalid index to scalar variable. This error doesn't occur when using the default 'Gaussian' model, however given that my dependent variable is binary this isn't appropriate.

MGWR_sel_bw_error

jagreen1 avatar Dec 21 '23 17:12 jagreen1

had to assign a value of zero

Yes, it would. See @larsiusprime's comment. It's a perfectly useful fix here.

What "Matrix is Singular" means is that the weighted least squares matrix (Xt W X) is not invertible. This is often because some variable in X is perfectly collinear with another variable. If you fill all your missing data with zeros and this missing data occurs more commonly in some localities, then it's entirely possible that you're getting all zeros in some local model for some covariate... like, all x values for sites within the bandwidth are zero. When this happens at one site, that x becomes perfectly collinear with the intercept, that local model becomes degenerate, and the error is thrown.

ljwolf avatar Dec 21 '23 18:12 ljwolf