differentialabundance icon indicating copy to clipboard operation
differentialabundance copied to clipboard

More flexible model and contrast definition

Open grst opened this issue 1 year ago • 12 comments

Description of feature

Currently, models and contrasts to compare are defined in the contrast.csv file as follows:

id,variable,reference,target,blocking

This limits contrasts to pairwise comparisons of a single variable and the model definition is implicit based on the variable and blocking variables. More complex comparisons such as interaction terms or mixed effects are not possible or require workarounds like constructing artificial variables combining multiple variables.

I propose to make the model definition more flexible by

  • explicitly specifying the model using a Wilkinson Formula (e.g. ~ treatment + response (variable + covariate) , ~ treatment * response (interaction model). This would also extend to mixed effects models, should they be added in the future, e.g. ~ treatment * timepoint + (1 | patient_id)
  • Define one or multiple contrasts per model. Multiple tests can be performed for a single model. This is also computationally more efficient since the model only needs to be fitted once.

How to define contrasts

There are different ways to define contrasts beyond simple comparisons. We need to discuss which ones to support and how.

  1. (variable, baseline, target) tuples -> I'd definitely keep this one way or another as this cover a lot of cases and is very intutitive to define. However, we still need alternatives for more complex cases
  2. Specifying coefficient names, e.g. treatmenta:responseresponder
  3. Contrast formula, e.g. responseresponder - responsenon_responder, or cond(response="responder") - cond(response="non_responder")
  4. Contrast vector (the most general case), e.g. (0, -1, 1, 0, 0, 0, 0)

The challenge for 2-3 is to define this in a way that's method-agnostic. We'd likely need a step that converts a coefficient name or contrast formula into a contrast vector. One such way is limma's makeContrasts function. Another way is the cond() method implemented in glmGamPoi:

cond(response="responder", treatment="A") - cond(response="non_responder", treatment="B")

If anyone's aware of other alternatives, I'm happy to consider them.

Configuration format

In principle it would still be possible to use a CSV file, e.g.

id formula contrast comment
treatment_a_vs_b ~ treatment + response treatment;a;b simple comparison
response_responder_vs_non_responder ~ treatment + response response;responder;non_responder same model, different contrast
response_responder_vs_non_responder2 ~ treatment + response responseresponder - responsenon_responder same comparison, different way to specify contrast
treatment_response_interaction ~ treatment * response treatmenta:responseresponder listing a coefficient rather than a comparison

Alternatively, I could see benefits from switching to json/yaml with an appropriate json schema for validation. IMO multiple contrasts per model and different ways to specify contrasts could be better represented with a hierarchical configuration format than csv. Having ;-separated fields within a csv column as it is currently implemented for blocking is a bit of a red flag as it's hard to read and hard to validate.

models:
  - formula: "~ treatment + response"
    contrasts: 
      - id: treatment_a_vs_b
        type: simple
        comparison: ["treatment", "A", "B"]
      - id: response_responder_vs_non_responder2
        type: formula
        comparison:  responseresponder - responsenon_responder
      - id: response_responder_vs_non_responder2
        type: cond
        comparison:  "cond(response="responder") - cond(response="non_responder")"
  - formula: "~ treatment * response"
    contrasts:
      - id: treatment_response_interaction
        type: coefficient
        comparison: treatmenta:responseresponder

LMK what you think

CC @apeltzer @tschwarzl @nschcolnicov @atrigila @alanmmobbs93 FYI @suzannejin

grst avatar Dec 02 '24 08:12 grst