differentialabundance
differentialabundance copied to clipboard
More flexible model and contrast definition
Description of feature
Currently, models and contrasts to compare are defined in the contrast.csv file as follows:
id,variable,reference,target,blocking
This limits contrasts to pairwise comparisons of a single variable and the model definition is implicit based on the variable and blocking variables. More complex comparisons such as interaction terms or mixed effects are not possible or require workarounds like constructing artificial variables combining multiple variables.
I propose to make the model definition more flexible by
- explicitly specifying the model using a Wilkinson Formula (e.g.
~ treatment + response(variable + covariate) ,~ treatment * response(interaction model). This would also extend to mixed effects models, should they be added in the future, e.g.~ treatment * timepoint + (1 | patient_id) - Define one or multiple contrasts per model. Multiple tests can be performed for a single model. This is also computationally more efficient since the model only needs to be fitted once.
How to define contrasts
There are different ways to define contrasts beyond simple comparisons. We need to discuss which ones to support and how.
(variable, baseline, target)tuples -> I'd definitely keep this one way or another as this cover a lot of cases and is very intutitive to define. However, we still need alternatives for more complex cases- Specifying coefficient names, e.g.
treatmenta:responseresponder - Contrast formula, e.g.
responseresponder - responsenon_responder, orcond(response="responder") - cond(response="non_responder") - Contrast vector (the most general case), e.g.
(0, -1, 1, 0, 0, 0, 0)
The challenge for 2-3 is to define this in a way that's method-agnostic. We'd likely need a step that converts a coefficient name or contrast formula into a contrast vector. One such way is limma's makeContrasts function. Another way is the cond() method implemented in glmGamPoi:
cond(response="responder", treatment="A") - cond(response="non_responder", treatment="B")
If anyone's aware of other alternatives, I'm happy to consider them.
Configuration format
In principle it would still be possible to use a CSV file, e.g.
| id | formula | contrast | comment |
|---|---|---|---|
| treatment_a_vs_b | ~ treatment + response | treatment;a;b | simple comparison |
| response_responder_vs_non_responder | ~ treatment + response | response;responder;non_responder | same model, different contrast |
| response_responder_vs_non_responder2 | ~ treatment + response | responseresponder - responsenon_responder | same comparison, different way to specify contrast |
| treatment_response_interaction | ~ treatment * response | treatmenta:responseresponder | listing a coefficient rather than a comparison |
Alternatively, I could see benefits from switching to json/yaml with an appropriate json schema for validation. IMO multiple contrasts per model and different ways to specify contrasts could be better represented with a hierarchical configuration format than csv. Having ;-separated fields within a csv column as it is currently implemented for blocking is a bit of a red flag as it's hard to read and hard to validate.
models:
- formula: "~ treatment + response"
contrasts:
- id: treatment_a_vs_b
type: simple
comparison: ["treatment", "A", "B"]
- id: response_responder_vs_non_responder2
type: formula
comparison: responseresponder - responsenon_responder
- id: response_responder_vs_non_responder2
type: cond
comparison: "cond(response="responder") - cond(response="non_responder")"
- formula: "~ treatment * response"
contrasts:
- id: treatment_response_interaction
type: coefficient
comparison: treatmenta:responseresponder
LMK what you think
CC @apeltzer @tschwarzl @nschcolnicov @atrigila @alanmmobbs93 FYI @suzannejin