patsy Sparse matrix from formula

Is it currently possible to build a sparse design matrix from a formula? This is desirable for formulas with lot of categorical variables and their interactions, which are naturally represented as a sparse matrix of zeros and a relatively few ones. R sparse.model.matrix produces an sparse matrix but, when the model includes interactions, it fallbacks to the standard model.matrix so a dense matrix is created as an intermediate step. I don't know if patsy includes any sparse matrix support at all; in case it doesn't consider this a request for enhancement.

Jun 19 '14 18:06 memeplex

Hello,

No, at the moment patsy has no support for directly consuming or producing sparse matrices. It's been on the TODO list since the beginning, and there's no particular obstacle to doing so, it's just that neither I nor anyone else have gotten around to implementing it. (If you want to have a go, I'll happily review the patch ;-).)

On Thu, Jun 19, 2014 at 7:02 PM, memeplex [email protected] wrote:

Is it currently possible to build a sparse design matrix from a formula? This is desirable for formulas with lot of categorical variables and their interactions, which are naturally represented as a sparse matrix of zeros and a relatively few ones. R sparse.model.matrix produces an sparse matrix but, when the model includes interactions, it fallbacks to the standard model.matrix so a dense matrix is created as an intermediate step. I don't know if patsy includes any sparse matrix support at all; in case it doesn't consider this a request for enhancement.

— Reply to this email directly or view it on GitHub https://github.com/pydata/patsy/issues/46.

Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org

Jun 19 '14 18:06 njsmith

FWIW, it's easy enough to do this in some settings:

X = patsy.dmatrices("y ~ col0 * col1 - 1", data, return_type="dataframe")[1]
X_csr = scipy.sparse.csr_matrix(X.values)

Jan 11 '15 20:01 kyleabeauchamp

Right, but this requires that you first create the dense matrix in memory, which kinda defeats the purpose in a lot of cases :-) On 11 Jan 2015 20:45, "Kyle Beauchamp" [email protected] wrote:

FWIW, it's easy enough to do this in some settings:

X = patsy.dmatrices("y ~ col0 * col1 - 1", data, return_type="dataframe")[1] X_csr = scipy.sparse.csr_matrix(X.values)

— Reply to this email directly or view it on GitHub https://github.com/pydata/patsy/issues/46#issuecomment-69510599.

Jan 11 '15 21:01 njsmith

Agreed. My particular use case was for doing MCMC with the sparse matrix via pymc2, however, so I was able to get massive speedups when taking products of model parameters and the featurized input data.

Jan 11 '15 21:01 kyleabeauchamp

@njsmith could you provide some pointers on what would be needed for patsy to support generating sparse matrices. Would be interested in submitting a patch.

Jan 03 '20 11:01 seanv507

I would be happy to help. I just don't have time to head on this.

On Fri, Jan 3, 2020, 12:50 seanv507 [email protected] wrote:

@njsmith https://github.com/njsmith could you provide some pointers on what would be needed for patsy to support generating sparse matrices. Would be interested in submitting a patch.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pydata/patsy/issues/46?email_source=notifications&email_token=ABKTSRP62R445OJC2QFND2TQ34Q73A5CNFSM4AQXKTP2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIA62UI#issuecomment-570551633, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABKTSRN7I2GHOZB5DGYUDG3Q34Q73ANCNFSM4AQXKTPQ .

Jan 03 '20 13:01 bashtage

I'd also be interested in this! @njsmith do you have pointers on what would be needed to build this patch? I'd be happy to help

Apr 03 '20 23:04 christinahedges

Hi all,

I was interested in this problem for speeding up some of my code. I've been relying on the excellent patsy library, but the datasets I use tend to have a lot of data points and splines can end up being slow to compute. I wrote up a faster b-spline basis implementation that uses scipy.sparse. I'm using it in this (open) pull request on the package I develop (lightkurve).

As written below this provides me with a roughly 2x speed up over the non-sparse patsy implementation, and uses 25% of the memory compared to non-sparse patsy.

I think this could still be improved upon and sped up, but if anyone needs a sparse implementation of b-splines, this is a good place to start.


import numpy as np
from scipy.sparse import csr_matrix, vstack
from numba import jit

# jit provides a massive speed up
@jit(nopython=True)
def basis(x, degree, i, knots):
    """Create a spline basis for an input x, for the ith knot

    Parameters
    ----------
    x : np.ndarray
        Input x
    degree : int
        Degree of spline to calculate basis for
    i : int
        The index of the knot to calculate the basis for
    knots : np.ndarray
        Array of all knots
    """
    if degree == 0:
        B = np.zeros(len(x))
        B[(x >= knots[i]) & (x <= knots[i+1])] = 1
        #B = (B)
    else:
#        alpha1, alpha2 = 0, 0
        da = (knots[degree + i] - knots[i])
        db = (knots[i + degree + 1] - knots[i + 1])
        if ((knots[degree + i] - knots[i]) != 0):
            alpha1 = ((x - knots[i])/da)
        else:
            alpha1 = np.zeros(len(x))
        if ((knots[i+degree+1] - knots[i+1]) != 0):
            alpha2 = ((knots[i + degree + 1] - x)/db)
        else:
            alpha2 = np.zeros(len(x))
        B = (basis(x, (degree-1), i, knots)) * (alpha1) + (basis(x, (degree-1), (i+1), knots)) * (alpha2)
    return B


def create_sparse_spline_matrix(x, n_knots=20, knots=None, degree=3):
    """Returns a scipy.sparse.csr_matrix which contains the B-spline basis

    Parameters
    ----------
    x : np.ndarray
        vector to spline
    n_knots: int
        Number of knots (default: 20).
    knots : np.ndarray [optional]
        Optional array containing knots
    degree: int
        Polynomial degree.

    Returns
    -------
    spline_dm: scipy.sparse.csr_matrix
        B-spline basis matrix
    """

    # To use jit we have to use float64
    x = np.asarray(x, np.float64)

    if not isinstance(n_knots, int):
        raise ValueError('`n_knots` must be an integer.')
    if n_knots - degree <= 0:
        raise ValueError('n_knots must be greater than degree.')

    if (knots is None)  and (n_knots is not None):
        knots = np.asarray([s[-1] for s in np.array_split(np.argsort(x), n_knots - degree)[:-1]])
        knots = [np.mean([x[k], x[k + 1]]) for k in knots]
    elif (knots is None)  and (n_knots is None):
        raise ValueError('Pass either `n_knots` or `knots`.')

    knots = np.append(np.append(x.min(), knots), x.max())
    knots = np.unique(knots)

    knots_wbounds = np.append(np.append([x.min()] * (degree - 1), knots), [x.max()] * (degree))
    matrices = [csr_matrix(basis(x, degree, idx, knots_wbounds)) for idx in np.arange(-1, len(knots_wbounds) - degree - 1)]
    spline_dm = vstack([m for m in matrices if (m.sum() != 0) ], format='csr').T
    return spline_dm

Apr 23 '20 02:04 christinahedges

Hi all! I also tackled this problem late last year, and the result was a new Python library called Formulaic. It is orders of magnitude faster than patsy, even beating R in many cases. I'd welcome feedback there :).

Documentation is still lacking, but you can do:

import pandas
from formulaic import Formula

df = pandas.DataFrame({
    'y': [0,1,2],
    'x': ['A', 'B', 'C'],
    'z': [0.3, 0.1, 0.2],
})

y,X = Formula('y ~ x + z').get_model_matrix(df, sparse=True)

which will result in a sparse model matrix.

Jul 06 '20 17:07 matthewwardrop

@matthewwardrop I've tried to use the package and the following script works :) X = Formula('x + z').get_model_matrix(df, output='sparse')

Apr 12 '22 16:04 jykr

Yes... Things have evolved! Glad it worked for you!

Apr 12 '22 20:04 matthewwardrop