mgwr
mgwr copied to clipboard
LinAlgError: Matrix is Singular
Hello everyone I am trying to fit a GWR model. I am following examples codes and each has the same pipeline. When I measure the "gwr_selector" an issue related to LingAlgError: Matriz is Singular appears. I have 2941 polygons and 20 variables to fit the model. The unique way codes work is to fit them with 150 polygons and 5 variables. Do you know what kind of mistake I am making? Bests

Yes. I think that bw_min is probably too small. If there are not enough observations, the least squares procedure will not be able to invert a local g_X. Here, you've defined this to use, at a minimum, two observations in each local model. If this is the case, then XtX will be singular when the two observations have the same value for any of the variables.
So, try increasing bw_min.
Does this not work by fitting to the full dataset of 2941 polygons and a larger bw_min?
Thank you for your quick answer. I tried different combinations of bw_min and It was not possible to solve the problem.
Then it may be related to your model specification. Is there any variable that is perfectly collinear?
Get Outlook for iOShttps://aka.ms/o0ukef
From: Santiago Cardona Urrea @.> Sent: Thursday, May 19, 2022 5:21:48 PM To: pysal/mgwr @.> Cc: Levi John Wolf @.>; Comment @.> Subject: Re: [pysal/mgwr] LinAlgError: Matrix is Singular (Issue #116)
This message could be from someone attempting to impersonate a member of UoB. Please do not share information with the sender without verifying their identity. If in doubt, please contact the IT Service Desk for advice. --
Thank you for your quick answer. I tried different combinations of bw_min and It was not possible to solve the problem.
— Reply to this email directly, view it on GitHubhttps://github.com/pysal/mgwr/issues/116#issuecomment-1131926687, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AARFR45TAPWCPTEMJKEIPWTVKZTBZANCNFSM5WMMW3PA. You are receiving this because you commented.Message ID: @.***>
I have checked colinearity but I did not find variables perfectly colinear.
You checked local collinearity? This can change at each different bandwidth that is explored when you are using the bandwidth search procedure.
On Thu, May 19, 2022 at 5:09 PM Santiago Cardona Urrea < @.***> wrote:
I have checked colinearity but I did not find variables perfectly colinear.
— Reply to this email directly, view it on GitHub https://github.com/pysal/mgwr/issues/116#issuecomment-1132210548, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB553TPTT76D3743ULVXX53VK2UZVANCNFSM5WMMW3PA . You are receiving this because you are subscribed to this thread.Message ID: @.***>
I have had this problem before and found a workaround, not sure if it's a "valid" approach or not:
If the variable that is causing you trouble is a floating point value, you might be able to get away with adding a little bit of random "dust" to it. For instance, I had a particular variable that for 80% of my observations was in the range of 1,000-10,000. But for about 20% of the observations, this variable is flat zero. The flat zeros were causing the issue if they happened to be the only ones in a particular bandwidth range, or so I surmised.
So my solution was to add "dust" to all the variable values. A random amount between 0.00-0.99. My final values will all be rounded to the nearest whole number anyways.
Adding the "dust" makes it so that all the troublesome parcels with the zero value are now technically different from one another. And hopefully the amount on them is so small that it won't meaningfully affect the predictions.
Hi @larsiusprime! that's a reasonable way to avoid the singularity issue if you can afford that small bit of random noise in your analysis budget. For most, adding a random value somewhere between [0,1e-4] is probably sufficient.
For any potential developer interested in solving this in our code, the solution would be to swap our current numpy.linalg.inv() to a pseudo-inverse, like pinv_extended() in statsmodels. See, for example, a regression that fits on a perfectly collinear input:
>>> import statsmodels
>>> import numpy
>>> from statsmodels import api as sm
>>> x = numpy.random.random(size=100)
>>> X = numpy.column_stack((numpy.ones_like(x), x, x,)) # perfectly collinear columns 2 & 3
>>> y = X @ numpy.array([[3, -2, 4])).T
>>> sm.OLS(endog=y, exog=X, hasconst=True).fit().summary()
"""
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 1.130e+31
Date: Fri, 28 Jul 2023 Prob (F-statistic): 0.00
Time: 09:51:05 Log-Likelihood: 3265.6
No. Observations: 100 AIC: -6527.
Df Residuals: 98 BIC: -6522.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 3.0000 3.66e-16 8.21e+15 0.000 3.000 3.000
x1 1.0000 2.97e-16 3.36e+15 0.000 1.000 1.000
x2 1.0000 2.97e-16 3.36e+15 0.000 1.000 1.000
==============================================================================
Omnibus: 6.528 Durbin-Watson: 0.115
Prob(Omnibus): 0.038 Jarque-Bera (JB): 8.807
Skew: 0.267 Prob(JB): 0.0122
Kurtosis: 4.352 Cond. No. 1.36e+17
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 9.07e-33. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
"""
Any idea if there will be a fix to this? The program essentially doesn't work and always results in this error.
I had this error 3 years ago, came back to the same project and the error is still present.
@jagreen1 there's an ongoing work to fix this in https://github.com/pysal/mgwr/pull/134
Good to hear. I have read through that thread, but don't have sufficient expertise myself to contribute. I'm essentially lost as how to progress with a project unless this can be fixed. Cheers.
@jagreen1 Can you fit a regular OLS on your data?
from spreg import OLS
OLS(y, x)
If you cannot fit an OLS, then the problem is not with MGWR (#132).
If you can, one way that frequently works is to increase the minimum bandwidth.
This LinAlgError can arise because the "local" model near a given site has all the same values for some feature. Forcing the bandwidth larger prevents this while also preventing overfitting.
You can see that the original poster of this issue is setting min_bw=2, which will pretty much always fail if there's a categorical/one-hot encoded feature.
Yes, there are no problems with the OLS. I'm using a logistic/binomial (0 or 1) dataset of 100k points, and a further subset of just 20k points. It works with some bandwidths and indiscriminately not with others.
Interesting, OK. And, to confirm, the issue arises in Sel_BW()?
Do you have any categorical/one-hot features, or are they all continuous?
This is something I've long been interested in conceptually... I hope to have the proof of concept linked above completed by early Jan.
@ljwolf Yes, I can confirm that this issue occurs during sel_bw, in my case for a binomial regression model.
The independent variables are continuous (not categorical), however where data wasn't available I had to assign values of zero. Not sure if that causes an issue.
I have primarily been using the MGWR GUI application, which often has the error LinAlgError: Matrix is Singular.
This occurs for seemingly random bandwidths. For example, for one dataset I tried running the GWR analysis for bandwidths 1770 to 1780 at intervals of 1. The regression ran for a bandwidth of 1780, but not for 1770 through 1779.
I decided to try analyzing the data purely in python (not using the GUI), and I now receive a slightly different error when calling sel_bw being IndexError: invalid index to scalar variable. This error doesn't occur when using the default 'Gaussian' model, however given that my dependent variable is binary this isn't appropriate.
had to assign a value of zero
Yes, it would. See @larsiusprime's comment. It's a perfectly useful fix here.
What "Matrix is Singular" means is that the weighted least squares matrix (Xt W X) is not invertible. This is often because some variable in X is perfectly collinear with another variable. If you fill all your missing data with zeros and this missing data occurs more commonly in some localities, then it's entirely possible that you're getting all zeros in some local model for some covariate... like, all x values for sites within the bandwidth are zero. When this happens at one site, that x becomes perfectly collinear with the intercept, that local model becomes degenerate, and the error is thrown.