formulaic Draft: Add support for mixed effects.

Draft: Add support for mixed effects.

Open matthewwardrop opened this issue 3 years ago • 7 comments

Hi @tomicapretto,

I stumbled across your cool formulae project recently, and it piqued my interest. One of the primary goals of formulaic was to be easily extensible to support new use-cases, much like that of formulae. As such, I wanted to try my hand at a quick extension to Formulaic that would support your use-case, while keeping it general so that others can plug into it also. I don't necessarily want you to throw away your excellent work on Formulae, nor is this PR immediately ready for merging... but I'd love to get your take on this. Perhaps there is opportunity for synergy.

With this PR, you can do:

import pandas
import numpy as np
n = 1000000
df = pandas.DataFrame(
    dict(
        y=np.random.normal(size=n),
        x1=np.random.normal(size=n),
        x2=np.random.normal(size=n),
        m=np.random.choice(list("abc"), size=n),
        n=np.random.choice(list("abc"), size=n),
        e=np.random.normal(size=n),
    )
)

from formulaic import model_matrix

mm = model_matrix("x1 + (x2|m) + (x1||n)", df)
mm

>         Intercept        x1  m[T.a]:x2  m[T.b]:x2  m[T.c]:x2  n[T.b]:x1  \
> 0             1.0 -0.638156  -0.000000  -0.465160  -0.000000  -0.000000   
> 1             1.0 -1.198965   0.000000   0.375026   0.000000  -1.198965   
> 2             1.0 -0.616979  -0.000000  -0.000000  -0.146729  -0.000000   
> 3             1.0 -0.579646  -0.000000  -0.702864  -0.000000  -0.000000   
> 4             1.0  1.837394   0.455112   0.000000   0.000000   1.837394   
> ...           ...       ...        ...        ...        ...        ...   
> 999995        1.0 -0.925889   0.867100   0.000000   0.000000  -0.925889   
> 999996        1.0  1.256724  -0.000000  -1.150915  -0.000000   1.256724   
> 999997        1.0  0.456776   0.352201   0.000000   0.000000   0.000000   
> 999998        1.0 -0.610327   0.384456   0.000000   0.000000  -0.000000   
> 999999        1.0 -0.840609   0.000000   0.191039   0.000000  -0.000000   
> 
>         n[T.c]:x1  
> 0       -0.000000  
> 1       -0.000000  
> 2       -0.000000  
> 3       -0.000000  
> 4        0.000000  
> ...           ...  
> 999995  -0.000000  
> 999996   0.000000  
> 999997   0.456776  
> 999998  -0.610327  
> 999999  -0.000000  
>
>[1000000 rows x 7 columns]

mm.subset() # Common/fixed effects
>         Intercept        x1
> 0             1.0 -0.638156
> 1             1.0 -1.198965
> 2             1.0 -0.616979
> 3             1.0 -0.579646
> 4             1.0  1.837394
> ...           ...       ...
> 999995        1.0 -0.925889
> 999996        1.0  1.256724
> 999997        1.0  0.456776
> 999998        1.0 -0.610327
> 999999        1.0 -0.840609
> 
> [1000000 rows x 2 columns]

mm.subset(['group']) # Group/Random effects
>         m[T.a]:x2  m[T.b]:x2  m[T.c]:x2
> 0       -0.000000  -0.465160  -0.000000
> 1        0.000000   0.375026   0.000000
> 2       -0.000000  -0.000000  -0.146729
> 3       -0.000000  -0.702864  -0.000000
> 4        0.455112   0.000000   0.000000
> ...           ...        ...        ...
> 999995   0.867100   0.000000   0.000000
> 999996  -0.000000  -1.150915  -0.000000
> 999997   0.352201   0.000000   0.000000
> 999998   0.384456   0.000000   0.000000
> 999999   0.000000   0.191039   0.000000
> 
> [1000000 rows x 3 columns]

mm.subset(['group_independent'])  # Group/Random effects assuming independence (missing one column due to enforcing of structural linear independence; can be disabled)
>         n[T.b]:x1  n[T.c]:x1
> 0       -0.000000  -0.000000
> 1       -1.198965  -0.000000
> 2       -0.000000  -0.000000
> 3       -0.000000  -0.000000
> 4        1.837394   0.000000
> ...           ...        ...
> 999995  -0.925889  -0.000000
> 999996   1.256724   0.000000
> 999997   0.000000   0.456776
> 999998  -0.000000  -0.610327
> 999999  -0.000000  -0.000000
> 
> [1000000 rows x 2 columns]

mm.subset(['group', 'group_independent'], exact=False)  # All Group/Random effects
>         m[T.a]:x2  m[T.b]:x2  m[T.c]:x2  n[T.b]:x1  n[T.c]:x1
> 0       -0.000000  -0.465160  -0.000000  -0.000000  -0.000000
> 1        0.000000   0.375026   0.000000  -1.198965  -0.000000
> 2       -0.000000  -0.000000  -0.146729  -0.000000  -0.000000
> 3       -0.000000  -0.702864  -0.000000  -0.000000  -0.000000
> 4        0.455112   0.000000   0.000000   1.837394   0.000000
> ...           ...        ...        ...        ...        ...
> 999995   0.867100   0.000000   0.000000  -0.925889  -0.000000
> 999996  -0.000000  -1.150915  -0.000000   1.256724   0.000000
> 999997   0.352201   0.000000   0.000000   0.000000   0.456776
> 999998   0.384456   0.000000   0.000000  -0.000000  -0.610327
> 999999   0.000000   0.191039   0.000000  -0.000000  -0.000000
> 
> [1000000 rows x 5 columns]

For something like this to be tenable for your use-cases, what further additions would be desirable? One thing I think could definitely be added pretty easily is aliasing of the terms to match the "|" notation.

Jul 11 '21 21:07 matthewwardrop

Hi @matthewwardrop, it's great news you're adding this to your Formulaic package! It's been an excellent source of inspiration for my formulae library. I'm happy to help you with this new feature. I'm not familiar with the codebase in Formulaic, but I'll try to give you some pointers that were useful to me when implementing this feature.

Compare the results you obtain with the ones from the lme4 package in R. Their interface to generate and access matrices for random effects is not the simplest one, but it's worth checking against their result to ensure you're not missing something. This is an example of what I mean.
I'm not sure how you're computing the matrix of the random effects, but here is a summary of how it is done in formulae (taken from lme4)

Zi = linalg.khatri_rao(Ji.T, Xi.T).T

Suppose you have a term predictor|group, then:

Xi is the design matrix that results when you consider predictor as a fixed effect (i.e. the result of doing model_matrix("0 + predictor"). Note that I'm dropping the intercept. Its corresponding design matrix is created separately, where Xi is a column vector of ones, i.e. the result of model_matrix("1").
Ji is a matrix of indicators. It's the result of doing model_matrix("0 + group"). Note again the intercept is dropped, so each column in Ji corresponds to a group in group.

This is much better explained in section 2.3 of Fitting Linear Mixed Effects Models using lme4.

It's great you're adding the || operator. But I'm not sure why you are dropping a group. Here's something you can do to check what's the meaning of that operator in lme4.

library(lme4)
library(plot.matrix)

# Random effects matrices using `||`
f1 = ~ Days + (Days || Subject)
terms = mkReTrms(findbars(f1), model.frame(subbars(f1), data = sleepstudy))

dim(t(terms$Ztlist$`1 | Subject`))         # 180 18
dim(t(terms$Ztlist$`0 + Days | Subject`))  # 180 18

# Random effects matrices using `|`
f2 = ~ Days + (Days | Subject)
terms2 = mkReTrms(findbars(f2), model.frame(subbars(f2), data = sleepstudy))

dim(t(terms2$Ztlist$`Days | Subject`))     # 180 36

interleave_matrices = function(m1, m2) {
  # m1 and m2 have the same number of columns
  m = matrix(nrow = nrow(m1), ncol = 2 * ncol(m1))
  for (j in seq(ncol(m1))) {
    m[, j * 2 - 1] = m1[, j]
    m[, j * 2] = m2[, j]
  }
  return(m)
}

m1 = interleave_matrices(
  as.matrix(t(terms$Ztlist$`1 | Subject`)),
  as.matrix(t(terms$Ztlist$`0 + Days | Subject`))
)

m2 = as.matrix(t(terms2$Ztlist$`Days | Subject`))
all(m1 == m2)
# TRUE

plot(as.matrix(m1), breaks = 10, col = viridis::viridis, main = "Double pipe ||")
plot(as.matrix(m2), breaks = 10, col = viridis::viridis, main = "Single pipe |")

# Conclusion: 
# Using `||` gives separated matrices for the intercept and the slope. This means
# that b_0 ~ N and b_1 ~ N (the effects for the intercept and the slope are 
# modeled as separated, uncorrelated, multivariate normal distributions). 
# On the other hand, using `|` results in one matrix for both the intercept and 
# the slope. Thus, (b_0, b_1) ~ N and the effects for the intercept and the slope
# have a joint multivariate normal distribution.


# In summary, using `||` does not change the columns of the design matrix, but
# how they are considered in terms of their correlation when fitting the model.

single_pipe double_pipe

Feel free to reach out to discuss further this new feature. It's great you're adding it to Formulaic! :D

Jul 12 '21 14:07 tomicapretto

Thanks @tomicapretto for your prompt reply and discussion in this space!

The Khatri Rao product is equivalent (unless I'm greatly mistaken) to the standard model matrix generation process (Cartesian product of colums generated by each factor vs. column-wise Kronecker product of the columns of the transposed matrices), which is nice because you don't need to do anything special. And yes... the || operator should only change the interpretation during modelling. In this case the dropping of the column in the || case was due to the algorithm used to ensure the entire model matrix remains structurally linearly independent (otherwise OLS algorithms and the like are guaranteed to have extra gauge degrees of freedom that will mess up the computation, and if done by matrix inversion will fail due to singular matrices); and could be fixed here by disabling the structural check: model_matrix(..., ensure_full_rank=False). In this case, I think we we want to "partition" the linear independence algorithm so that the fixed and random effect pieces are independently structurally linearly independent, but not enforce this across the entire set of matrices. Does this sound right to you? Or do you always need all the (linearly dependent) columns in the random effects matrices? I'll probably have to review the EM algorithm to convince myself of this.

Thanks again! And I'll ping you if/when I get something more solid!

Jul 12 '21 16:07 matthewwardrop

In this case, I think we we want to "partition" the linear independence algorithm so that the fixed and random effect pieces are independently structurally linearly independent, but not enforce this across the entire set of matrices. Does this sound right to you? Or do you always need all the (linearly dependent) columns in the random effects matrices? I'll probably have to review the EM algorithm to convince myself of this.

As far as I know, it's not necessary to enforce linear independence for the columns of the entire matrix Z, neither to enforce these to be independent of X. In fact, both lme4 and formulae can result in Z matrices where rank(Z) < ncol(Z). See the following example where there are two random effects for intercepts.

model = lmer(diameter ~ 1 + (1|plate) + (1|sample), Penicillin)
mm = as.matrix(t(model@pp$`.->Zt`))
ncol(mm)       # 30
qr(mm)$rank # 29

Next, we can see that the columns of Z for (1|plate) and the columns of Z for (1|sample) add up to a vector of ones, which is not a surprise since they are both matrices for random intercepts.

f1 = ~ 1 + (1|plate) + (1|sample)
terms = mkReTrms(findbars(f1), model.frame(subbars(f1), data = Penicillin))

Z1 = as.matrix(t(terms$Ztlist$`1 | plate`))   # 144 24
Z2 = as.matrix(t(terms$Ztlist$`1 | sample`))  # 144 6

f1 = ~ 1 + (1|plate) + (1|sample)
terms = mkReTrms(findbars(f1), model.frame(subbars(f1), data = Penicillin))

Z1 = as.matrix(t(terms$Ztlist$`1 | plate`))   # 144 24
Z2 = as.matrix(t(terms$Ztlist$`1 | sample`))  # 144 6

all(apply(Z1, 1, sum) == 1) # TRUE
all(apply(Z1, 1, sum) == apply(Z1, 1, sum)) # TRUE

Also this can be useful.

Note, however, the situation described in the post is not the same as the one described in my examle. In the example, we have two partitions of Z that are independent within (their columns are linearly independent) but dependent between (the sum of the columns of both blocks are equal). I'm sorry if this is not the right technical language, I hope I can transmit the idea. On the other hand, the example in MixedModels.jl docs uses a categorical variable that with full encoding (i.e. one column per level of the variable) without checking if that introduces linear dependencies. This last situation can't happen in formulae because it is prevented the same way as in the matrix of the fixed effects.

Well, I hope this helps with your work. If I'm not very clear, please don't hesitate to ask ping me again :smile:

Jul 13 '21 21:07 tomicapretto

Hi @tomicapretto ,

Finally circling back around to this. Formulaic has come a long way since this original draft PR, and now supports arbitrarily structured formulae. There are perhaps a few gaps, but this now works like:

>>> from formulaic import Formula
>>> Formula("(a|g) + (b*c|g*h) + (d||i)")
root:
    1
.group:
    a|g + b|g + b|h + c|g + c|h + b:c|g + b:c|h + b|g:h + c|g:h + b:c|g:h
.group_independent:
    d||i
>>> Formula("y ~ x + (a+b|g)")
.lhs:
    y
.rhs:
    root:
        1 + x
    .group:
        a|g + b|g

Compared to the prior draft, each of the group terms also preserve their rank under the ensure_full_rank options, which means that the model matrices are complete as they were not in the past.

The only thing left to do, which is perhaps optional given that the information about grouping is now in the model spec, is to cause the model matrices to preserve the '|' in the column names:

>>> df = pandas.DataFrame({"a": [1,2,3,4], "g": list('wxyz')})
>>> mm = Formula("(a|g)").get_model_matrix(df)
>>> mm
root:
       Intercept
    0        1.0
    1        1.0
    2        1.0
    3        1.0
.group:
       g[T.w]:a  g[T.x]:a  g[T.y]:a  g[T.z]:a
    0         1         0         0         0
    1         0         2         0         0
    2         0         0         3         0
    3         0         0         0         4
>>> mm.model_spec.group.terms
[a|g]
>>> mm.model_spec.group.term_indices
OrderedDict([(a|g, (0, 1, 2, 3))])
...

Fixing this would be a little annoying and special-cased, but definitely doable.

Thoughts?

Oct 07 '22 17:10 matthewwardrop

@matthewwardrop thanks for tagging me here! I'll provide feedback within this week! I'm excited about this change :smile:

Oct 11 '22 11:10 tomicapretto

@matthewwardrop first of all, congrats because of all the progress with formulaic, it's definitely a fantastic library and I think I have never seen something even closer to this in the entire Python ecosystem. This s a wonderful solution.

About the group-specific terms, the ~~RHS~~ LHS of | acts as a regular model formula. It is, there's an implicit intercept as well. When doing Formula("(a|g)").get_model_matrix(df) I would expect the same result than when doing Formula("(1 + a| g)").get_model_matrix(df). See:

import pandas as pd
from formulaic import Formula
df = pd.DataFrame({"a": [1, 2, 3, 4], "g": list('wxyz')})
Formula("(a | g)").get_model_matrix(df)

root:
       Intercept
    0        1.0
    1        1.0
    2        1.0
    3        1.0
.group:
       g[T.w]:a  g[T.x]:a  g[T.y]:a  g[T.z]:a
    0         1         0         0         0
    1         0         2         0         0
    2         0         0         3         0
    3         0         0         0         4

Formula("(1 + a | g)").get_model_matrix(df)

root:
       Intercept
    0        1.0
    1        1.0
    2        1.0
    3        1.0
.group:
       g[T.w]  g[T.x]  g[T.y]  g[T.z]  g[T.w]:a  g[T.x]:a  g[T.y]:a  g[T.z]:a
    0       1       0       0       0         1         0         0         0
    1       0       1       0       0         0         2         0         0
    2       0       0       1       0         0         0         3         0
    3       0       0       0       1         0         0         0         4

On top of that, there are a couple of things about names that could be changed

a:g would be better than g:a
a|g would be better than a:g
Removing the T. from the treatment encoding would be better I think. I know you can still say that you're using a treatment encoding, but what's actually happening is that you're defining a design matrix conditional on the levels of the grouping variable g.
1|g for the name would be better than g

I don't think these points would be critical for a developer trying to use formulaic since these are objects within the .group Structure (I'm not sure if that's the right name) so it can be deduced they're terms for a group-specific term.

Another issue. Since the LHS of | is treated as a regular formula, we need to consider the same constrains as when we encode a categorical predictor in the presence of an intercept (e.g. Treatment encoding dropping a level of the categorical predictor).

See the following examples

import numpy as np
import pandas as pd

from formulaic import Formula
from formulae import design_matrices

df = pd.DataFrame({"g": np.repeat(list("ABCD"), 3), "x1": np.tile(list("xyz"), 4)})

Using formulae

dm = design_matrices("(x1 | g)", df)
dm.group

GroupEffectsMatrix with shape (12, 12)
Terms:  
  1|g  
    kind: intercept
    groups: ['A', 'B', 'C', 'D']
    columns: 0:4
  x1|g  
    kind: categoric
    groups: ['A', 'B', 'C', 'D']
    levels: ['y', 'z']
    columns: 4:12

dm.group.terms["x1|g"].labels

['x1[y]|g[A]',
 'x1[z]|g[A]',
 'x1[y]|g[B]',
 'x1[z]|g[B]',
 'x1[y]|g[C]',
 'x1[z]|g[C]',
 'x1[y]|g[D]',
 'x1[z]|g[D]']

If we drop the intercept from the LHS of the group-specific term, then we have all the levels of x1

dm = design_matrices("(0 + x1 | g)", df)
print(dm.group)
print(dm.group.terms["x1|g"].labels)

GroupEffectsMatrix with shape (12, 12)
Terms:  
  x1|g  
    kind: categoric
    groups: ['A', 'B', 'C', 'D']
    levels: ['x', 'y', 'z']
    columns: 0:12

['x1[x]|g[A]', 'x1[y]|g[A]', 'x1[z]|g[A]', 'x1[x]|g[B]', 'x1[y]|g[B]', 'x1[z]|g[B]', 'x1[x]|g[C]', 'x1[y]|g[C]', 'x1[z]|g[C]', 'x1[x]|g[D]', 'x1[y]|g[D]', 'x1[z]|g[D]']

And if we use formulaic, this is still not happening

Formula("(0 + x1 | g)").get_model_matrix(df).group.columns

Index(['g[T.A]:x1[T.x]', 'g[T.A]:x1[T.y]', 'g[T.A]:x1[T.z]', 'g[T.B]:x1[T.x]',
       'g[T.B]:x1[T.y]', 'g[T.B]:x1[T.z]', 'g[T.C]:x1[T.x]', 'g[T.C]:x1[T.y]',
       'g[T.C]:x1[T.z]', 'g[T.D]:x1[T.x]', 'g[T.D]:x1[T.y]', 'g[T.D]:x1[T.z]'],
      dtype='object')

Formula("(1 + x1 | g)").get_model_matrix(df).group.columns

Index(['g[T.A]', 'g[T.B]', 'g[T.C]', 'g[T.D]', 'g[T.A]:x1[T.x]',
       'g[T.A]:x1[T.y]', 'g[T.A]:x1[T.z]', 'g[T.B]:x1[T.x]', 'g[T.B]:x1[T.y]',
       'g[T.B]:x1[T.z]', 'g[T.C]:x1[T.x]', 'g[T.C]:x1[T.y]', 'g[T.C]:x1[T.z]',
       'g[T.D]:x1[T.x]', 'g[T.D]:x1[T.y]', 'g[T.D]:x1[T.z]'],
      dtype='object')

Oct 14 '22 23:10 tomicapretto

This would be the last example using R and the lme4 library

> library(lme4)
> 
> df <- data.frame(
+  g = rep(LETTERS[1:4], each = 3),
+  x1 = rep(letters[24:26], times = 4)
+ )
> 
> print(df)
   g x1
1  A  x
2  A  y
3  A  z
4  B  x
5  B  y
6  B  z
7  C  x
8  C  y
9  C  z
10 D  x
11 D  y
12 D  z
> 
> f1 <- ~ (x1 | g)
> f2 <- ~ (0 + x1 | g)
> 
> lme4_terms_f1 <- mkReTrms(findbars(f1), model.frame(subbars(f1), data = df))
> lme4_terms_f2 <- mkReTrms(findbars(f2), model.frame(subbars(f2), data = df))
> 
> names(lme4_terms_f1$Ztlist)
[1] "x1 | g"
> names(lme4_terms_f2$Ztlist)
[1] "0 + x1 | g"
> 
> x1_g_f1 <- as.matrix(t(lme4_terms_f1$Ztlist$`x1 | g`))
> x1_g_f2 <- as.matrix(t(lme4_terms_f2$Ztlist$`0 + x1 | g`))
> 
> print(x1_g_f1)
   A A A B B B C C C D D D
1  1 0 0 0 0 0 0 0 0 0 0 0
2  1 1 0 0 0 0 0 0 0 0 0 0
3  1 0 1 0 0 0 0 0 0 0 0 0
4  0 0 0 1 0 0 0 0 0 0 0 0
5  0 0 0 1 1 0 0 0 0 0 0 0
6  0 0 0 1 0 1 0 0 0 0 0 0
7  0 0 0 0 0 0 1 0 0 0 0 0
8  0 0 0 0 0 0 1 1 0 0 0 0
9  0 0 0 0 0 0 1 0 1 0 0 0
10 0 0 0 0 0 0 0 0 0 1 0 0
11 0 0 0 0 0 0 0 0 0 1 1 0
12 0 0 0 0 0 0 0 0 0 1 0 1
> print(x1_g_f2)
   A A A B B B C C C D D D
1  1 0 0 0 0 0 0 0 0 0 0 0
2  0 1 0 0 0 0 0 0 0 0 0 0
3  0 0 1 0 0 0 0 0 0 0 0 0
4  0 0 0 1 0 0 0 0 0 0 0 0
5  0 0 0 0 1 0 0 0 0 0 0 0
6  0 0 0 0 0 1 0 0 0 0 0 0
7  0 0 0 0 0 0 1 0 0 0 0 0
8  0 0 0 0 0 0 0 1 0 0 0 0
9  0 0 0 0 0 0 0 0 1 0 0 0
10 0 0 0 0 0 0 0 0 0 1 0 0
11 0 0 0 0 0 0 0 0 0 0 1 0
12 0 0 0 0 0 0 0 0 0 0 0 1
> 
> print(lme4_terms_f1$cnms)
$g
[1] "(Intercept)" "x1y"         "x1z"        

> print(lme4_terms_f2$cnms)
$g
[1] "x1x" "x1y" "x1z"

Edit I'm more than happy to help with whatever I can help :D

Oct 14 '22 23:10 tomicapretto

formulaic formulaic copied to clipboard

Draft: Add support for mixed effects.

formulaic
formulaic copied to clipboard