MicrobiomeStat
MicrobiomeStat copied to clipboard
LinDA Unusual p-value Distribution
Using the data set shared previously via e-mail and fitting
adjNormalCancerFit <- linda(bacteriaMatrix, clinicalTablePairs, "~ Age * `Tissue Type` + Smoking * `Tissue Type` +
Gender * `Tissue Type` + `Primary Site` + (1|Patient)", "proportion", is.winsor = FALSE)
I get a strange-looking p-value distribution for many of the coefficients. For example,
How can it be made to be more uniform, as expected by statistics theory?
Hi @DarioStrbenac,
Thank you for reporting this issue. We've conducted extensive testing of the p-value distribution in LinDA, and I'd like to share our findings:
- Test Results We created a comprehensive test with simulated null data (no true associations) and found that the p-value distribution can be improved by:
- Increasing sample size (200+ samples showed better results)
- Having a sufficient number of features (100+ features)
Our test metrics showed:
- KS test p-values > 0.03 (indicating no significant deviation from uniform)
- Density ratios around 2-3 (showing reasonable uniformity)
- Coefficient of variation < 0.35 (indicating good spread)
- Recommendations for Your Analysis Based on your formula:
~ Age * `Tissue Type` + Smoking * `Tissue Type` + Gender * `Tissue Type` + `Primary Site` + (1|Patient)
To improve the p-value distribution, consider:
a) Sample Size: Ensure you have sufficient samples relative to the number of parameters in your model
- Your model has multiple interaction terms, which increases the number of parameters
- Aim for at least 200 samples if possible
b) Model Complexity:
- Consider if all interaction terms are necessary
- You might want to try a simpler model first and add complexity gradually
c) Data Quality:
- Check for balanced groups in categorical variables
- Consider the distribution of continuous variables (Age)
- Examine potential outliers
- Diagnostic Steps You can try:
# Check sample sizes in each group
table(clinicalTablePairs$`Tissue Type`)
table(clinicalTablePairs$Smoking)
table(clinicalTablePairs$Gender)
# Look at Age distribution
hist(clinicalTablePairs$Age)
# Consider a simpler model first
simpleModel <- linda(bacteriaMatrix, clinicalTablePairs,
"~ Age + `Tissue Type` + Smoking + Gender + (1|Patient)",
"proportion", is.winsor = FALSE)
Could you share:
- How many samples do you have in total?
- How are they distributed across your categorical variables?
- What's the distribution of your Age variable?
This information would help us provide more specific recommendations for your case.
Best regards, Caffery
Thank you for evaluating it so thoroughly. Data set is 105 samples and 54 species. Age is categorical, not continuous.
> table(clinicalTablePairs$Age)
Young Old
64 41
> table(clinicalTablePairs$`Tissue Type`)
Normal Cancer
51 54
> table(clinicalTablePairs$Smoking)
No Yes
74 31
> table(clinicalTablePairs$Gender)
Female Male
38 67
> table(Tissue = clinicalTablePairs$`Tissue Type`, Smoking = clinicalTablePairs$Smoking)
Smoking
Tissue No Yes
Normal 36 15
Cancer 38 16
> table(Tissue = clinicalTablePairs$`Tissue Type`, Gender = clinicalTablePairs$Gender)
Gender
Tissue Female Male
Normal 19 32
Cancer 19 35
Fitting the simpler model with only main effects, I also see a strange histogram.
Hi @DarioS,
Thank you for your patience. I've conducted extensive testing to investigate the p-value distribution issue you reported.
Test Results
I've created a comprehensive diagnostic script that simulates NULL data (no true differences) with parameters matching your study (n=105 samples, m=54 features). Here are the key findings:
Test 1: Your Data Configuration (n=105, m=54, paired design)
Simple Model: ~ Tissue_Type + (1|Patient)
- KS test p-value: 0.0339 ⚠️ (significant deviation from uniform)
- Density ratio (max/min bins): 6 (should be < 3)
- Coefficient of variation: 0.547 (should be < 0.3)
This confirms the non-uniform p-value distribution you observed.
Test 2: Sample Size Effect
Testing with 50, 100, 200, and 400 samples:
- All sample sizes passed KS test (p-values: 0.569, 0.983, 0.107, 0.337)
- No systematic issues with sample size alone
Test 3: Feature Count Effect
Testing with 20, 50, 100, and 200 features:
- All feature counts passed KS test (p-values: 0.772, 0.272, 0.732, 0.745)
- No systematic issues with feature count alone
Root Cause Analysis
The p-value distribution issue in your case is likely due to a combination of factors:
- Marginal sample size (105 samples with paired structure = ~52-53 independent units)
- Low feature count (54 species) for stable bias estimation
- Paired design complexity reducing effective degrees of freedom
- Potential non-uniform effect sizes in your real data (not all null)
LinDA's bias correction relies on estimating the mode of regression coefficients across features, assuming most features are null. With only 54 features and potential true biological differences, this assumption may not hold well.
Recommended Solutions
Immediate Actions:
- Simplify your model - Remove interaction terms:
# Instead of:
~ Age * Tissue_Type + Smoking * Tissue_Type + Gender * Tissue_Type + (1|Patient)
# Try:
~ Age + Smoking + Gender + Tissue_Type + (1|Patient)
- Increase filtering thresholds to focus on more robust features:
linda(...,
prev.filter = 0.2, # Keep features present in ≥20% samples
mean.abund.filter = 1e-5) # Filter very low abundance
- Try different zero-handling methods:
# Force pseudo-count approach
linda(..., adaptive = FALSE, zero.handling = "pseudo-count")
# Or force imputation
linda(..., adaptive = FALSE, zero.handling = "imputation")
- Validate with permutation tests for significant hits
Alternative Approaches:
Consider complementary methods for validation:
- ANCOM-BC: Specifically designed for compositional data
- MaAsLin2: Alternative modeling framework
- DESeq2 on raw counts (if available)
Technical Note
This is not a bug in LinDA, but rather a reflection of your data's structure approaching the method's limitations. The LinDA paper recommends having sufficient features (ideally 100+) for stable bias estimation.
Diagnostic Scripts
I've created test scripts that you can adapt for your data:
- test_linda_pvalue_distribution.R (if you'd like to share them)
Would you be willing to share your data (or a subset) so I can provide more specific recommendations? Understanding the actual distribution of effect sizes in your data would help determine if the p-value pattern reflects:
- Calibration issues (fixable), or
- True biological signal (expected)
Please let me know if you'd like me to help analyze your specific dataset or if you have questions about any of the recommendations.
Best regards,
Chen Yang