Sparse matrix from formula
Is it currently possible to build a sparse design matrix from a formula? This is desirable for formulas with lot of categorical variables and their interactions, which are naturally represented as a sparse matrix of zeros and a relatively few ones. R sparse.model.matrix produces an sparse matrix but, when the model includes interactions, it fallbacks to the standard model.matrix so a dense matrix is created as an intermediate step. I don't know if patsy includes any sparse matrix support at all; in case it doesn't consider this a request for enhancement.
Hello,
No, at the moment patsy has no support for directly consuming or producing sparse matrices. It's been on the TODO list since the beginning, and there's no particular obstacle to doing so, it's just that neither I nor anyone else have gotten around to implementing it. (If you want to have a go, I'll happily review the patch ;-).)
On Thu, Jun 19, 2014 at 7:02 PM, memeplex [email protected] wrote:
Is it currently possible to build a sparse design matrix from a formula? This is desirable for formulas with lot of categorical variables and their interactions, which are naturally represented as a sparse matrix of zeros and a relatively few ones. R sparse.model.matrix produces an sparse matrix but, when the model includes interactions, it fallbacks to the standard model.matrix so a dense matrix is created as an intermediate step. I don't know if patsy includes any sparse matrix support at all; in case it doesn't consider this a request for enhancement.
— Reply to this email directly or view it on GitHub https://github.com/pydata/patsy/issues/46.
Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org
FWIW, it's easy enough to do this in some settings:
X = patsy.dmatrices("y ~ col0 * col1 - 1", data, return_type="dataframe")[1]
X_csr = scipy.sparse.csr_matrix(X.values)
Right, but this requires that you first create the dense matrix in memory, which kinda defeats the purpose in a lot of cases :-) On 11 Jan 2015 20:45, "Kyle Beauchamp" [email protected] wrote:
FWIW, it's easy enough to do this in some settings:
X = patsy.dmatrices("y ~ col0 * col1 - 1", data, return_type="dataframe")[1] X_csr = scipy.sparse.csr_matrix(X.values)
— Reply to this email directly or view it on GitHub https://github.com/pydata/patsy/issues/46#issuecomment-69510599.
Agreed. My particular use case was for doing MCMC with the sparse matrix via pymc2, however, so I was able to get massive speedups when taking products of model parameters and the featurized input data.
@njsmith could you provide some pointers on what would be needed for patsy to support generating sparse matrices. Would be interested in submitting a patch.
I would be happy to help. I just don't have time to head on this.
On Fri, Jan 3, 2020, 12:50 seanv507 [email protected] wrote:
@njsmith https://github.com/njsmith could you provide some pointers on what would be needed for patsy to support generating sparse matrices. Would be interested in submitting a patch.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pydata/patsy/issues/46?email_source=notifications&email_token=ABKTSRP62R445OJC2QFND2TQ34Q73A5CNFSM4AQXKTP2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIA62UI#issuecomment-570551633, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABKTSRN7I2GHOZB5DGYUDG3Q34Q73ANCNFSM4AQXKTPQ .
I'd also be interested in this! @njsmith do you have pointers on what would be needed to build this patch? I'd be happy to help
Hi all,
I was interested in this problem for speeding up some of my code. I've been relying on the excellent patsy library, but the datasets I use tend to have a lot of data points and splines can end up being slow to compute. I wrote up a faster b-spline basis implementation that uses scipy.sparse. I'm using it in this (open) pull request on the package I develop (lightkurve).
As written below this provides me with a roughly 2x speed up over the non-sparse patsy implementation, and uses 25% of the memory compared to non-sparse patsy.
I think this could still be improved upon and sped up, but if anyone needs a sparse implementation of b-splines, this is a good place to start.
import numpy as np
from scipy.sparse import csr_matrix, vstack
from numba import jit
# jit provides a massive speed up
@jit(nopython=True)
def basis(x, degree, i, knots):
"""Create a spline basis for an input x, for the ith knot
Parameters
----------
x : np.ndarray
Input x
degree : int
Degree of spline to calculate basis for
i : int
The index of the knot to calculate the basis for
knots : np.ndarray
Array of all knots
"""
if degree == 0:
B = np.zeros(len(x))
B[(x >= knots[i]) & (x <= knots[i+1])] = 1
#B = (B)
else:
# alpha1, alpha2 = 0, 0
da = (knots[degree + i] - knots[i])
db = (knots[i + degree + 1] - knots[i + 1])
if ((knots[degree + i] - knots[i]) != 0):
alpha1 = ((x - knots[i])/da)
else:
alpha1 = np.zeros(len(x))
if ((knots[i+degree+1] - knots[i+1]) != 0):
alpha2 = ((knots[i + degree + 1] - x)/db)
else:
alpha2 = np.zeros(len(x))
B = (basis(x, (degree-1), i, knots)) * (alpha1) + (basis(x, (degree-1), (i+1), knots)) * (alpha2)
return B
def create_sparse_spline_matrix(x, n_knots=20, knots=None, degree=3):
"""Returns a scipy.sparse.csr_matrix which contains the B-spline basis
Parameters
----------
x : np.ndarray
vector to spline
n_knots: int
Number of knots (default: 20).
knots : np.ndarray [optional]
Optional array containing knots
degree: int
Polynomial degree.
Returns
-------
spline_dm: scipy.sparse.csr_matrix
B-spline basis matrix
"""
# To use jit we have to use float64
x = np.asarray(x, np.float64)
if not isinstance(n_knots, int):
raise ValueError('`n_knots` must be an integer.')
if n_knots - degree <= 0:
raise ValueError('n_knots must be greater than degree.')
if (knots is None) and (n_knots is not None):
knots = np.asarray([s[-1] for s in np.array_split(np.argsort(x), n_knots - degree)[:-1]])
knots = [np.mean([x[k], x[k + 1]]) for k in knots]
elif (knots is None) and (n_knots is None):
raise ValueError('Pass either `n_knots` or `knots`.')
knots = np.append(np.append(x.min(), knots), x.max())
knots = np.unique(knots)
knots_wbounds = np.append(np.append([x.min()] * (degree - 1), knots), [x.max()] * (degree))
matrices = [csr_matrix(basis(x, degree, idx, knots_wbounds)) for idx in np.arange(-1, len(knots_wbounds) - degree - 1)]
spline_dm = vstack([m for m in matrices if (m.sum() != 0) ], format='csr').T
return spline_dm
Hi all! I also tackled this problem late last year, and the result was a new Python library called Formulaic. It is orders of magnitude faster than patsy, even beating R in many cases. I'd welcome feedback there :).
Documentation is still lacking, but you can do:
import pandas
from formulaic import Formula
df = pandas.DataFrame({
'y': [0,1,2],
'x': ['A', 'B', 'C'],
'z': [0.3, 0.1, 0.2],
})
y,X = Formula('y ~ x + z').get_model_matrix(df, sparse=True)
which will result in a sparse model matrix.
@matthewwardrop I've tried to use the package and the following script works :)
X = Formula('x + z').get_model_matrix(df, output='sparse')
Yes... Things have evolved! Glad it worked for you!