MicrobiomeStat icon indicating copy to clipboard operation
MicrobiomeStat copied to clipboard

LinDA Unusual p-value Distribution

Open DarioS opened this issue 1 year ago • 3 comments

Using the data set shared previously via e-mail and fitting

adjNormalCancerFit <- linda(bacteriaMatrix, clinicalTablePairs, "~ Age * `Tissue Type` + Smoking * `Tissue Type` +
                            Gender * `Tissue Type` + `Primary Site` + (1|Patient)", "proportion", is.winsor = FALSE)

I get a strange-looking p-value distribution for many of the coefficients. For example,

image

How can it be made to be more uniform, as expected by statistics theory?

DarioS avatar May 29 '24 23:05 DarioS

Hi @DarioStrbenac,

Thank you for reporting this issue. We've conducted extensive testing of the p-value distribution in LinDA, and I'd like to share our findings:

  1. Test Results We created a comprehensive test with simulated null data (no true associations) and found that the p-value distribution can be improved by:
  • Increasing sample size (200+ samples showed better results)
  • Having a sufficient number of features (100+ features)

Our test metrics showed:

  • KS test p-values > 0.03 (indicating no significant deviation from uniform)
  • Density ratios around 2-3 (showing reasonable uniformity)
  • Coefficient of variation < 0.35 (indicating good spread)
  1. Recommendations for Your Analysis Based on your formula:
~ Age * `Tissue Type` + Smoking * `Tissue Type` + Gender * `Tissue Type` + `Primary Site` + (1|Patient)

To improve the p-value distribution, consider:

a) Sample Size: Ensure you have sufficient samples relative to the number of parameters in your model

  • Your model has multiple interaction terms, which increases the number of parameters
  • Aim for at least 200 samples if possible

b) Model Complexity:

  • Consider if all interaction terms are necessary
  • You might want to try a simpler model first and add complexity gradually

c) Data Quality:

  • Check for balanced groups in categorical variables
  • Consider the distribution of continuous variables (Age)
  • Examine potential outliers
  1. Diagnostic Steps You can try:
# Check sample sizes in each group
table(clinicalTablePairs$`Tissue Type`)
table(clinicalTablePairs$Smoking)
table(clinicalTablePairs$Gender)

# Look at Age distribution
hist(clinicalTablePairs$Age)

# Consider a simpler model first
simpleModel <- linda(bacteriaMatrix, clinicalTablePairs, 
                    "~ Age + `Tissue Type` + Smoking + Gender + (1|Patient)", 
                    "proportion", is.winsor = FALSE)

Could you share:

  1. How many samples do you have in total?
  2. How are they distributed across your categorical variables?
  3. What's the distribution of your Age variable?

This information would help us provide more specific recommendations for your case.

Best regards, Caffery

cafferychen777 avatar Nov 04 '24 20:11 cafferychen777

Thank you for evaluating it so thoroughly. Data set is 105 samples and 54 species. Age is categorical, not continuous.

> table(clinicalTablePairs$Age)
Young   Old 
   64    41
> table(clinicalTablePairs$`Tissue Type`)
Normal Cancer 
    51     54
> table(clinicalTablePairs$Smoking)
 No Yes 
 74  31
 > table(clinicalTablePairs$Gender)
Female   Male 
    38     67
> table(Tissue = clinicalTablePairs$`Tissue Type`, Smoking = clinicalTablePairs$Smoking)
        Smoking
Tissue   No Yes
  Normal 36  15
  Cancer 38  16
> table(Tissue = clinicalTablePairs$`Tissue Type`, Gender = clinicalTablePairs$Gender)
        Gender
Tissue   Female Male
  Normal     19   32
  Cancer     19   35      

Fitting the simpler model with only main effects, I also see a strange histogram.

image

DarioS avatar Nov 06 '24 10:11 DarioS

Hi @DarioS,

Thank you for your patience. I've conducted extensive testing to investigate the p-value distribution issue you reported.

Test Results

I've created a comprehensive diagnostic script that simulates NULL data (no true differences) with parameters matching your study (n=105 samples, m=54 features). Here are the key findings:

Test 1: Your Data Configuration (n=105, m=54, paired design)

Simple Model: ~ Tissue_Type + (1|Patient)

  • KS test p-value: 0.0339 ⚠️ (significant deviation from uniform)
  • Density ratio (max/min bins): 6 (should be < 3)
  • Coefficient of variation: 0.547 (should be < 0.3)

This confirms the non-uniform p-value distribution you observed.

Test 2: Sample Size Effect

Testing with 50, 100, 200, and 400 samples:

  • All sample sizes passed KS test (p-values: 0.569, 0.983, 0.107, 0.337)
  • No systematic issues with sample size alone

Test 3: Feature Count Effect

Testing with 20, 50, 100, and 200 features:

  • All feature counts passed KS test (p-values: 0.772, 0.272, 0.732, 0.745)
  • No systematic issues with feature count alone

Root Cause Analysis

The p-value distribution issue in your case is likely due to a combination of factors:

  1. Marginal sample size (105 samples with paired structure = ~52-53 independent units)
  2. Low feature count (54 species) for stable bias estimation
  3. Paired design complexity reducing effective degrees of freedom
  4. Potential non-uniform effect sizes in your real data (not all null)

LinDA's bias correction relies on estimating the mode of regression coefficients across features, assuming most features are null. With only 54 features and potential true biological differences, this assumption may not hold well.

Recommended Solutions

Immediate Actions:

  1. Simplify your model - Remove interaction terms:
# Instead of:
~ Age * Tissue_Type + Smoking * Tissue_Type + Gender * Tissue_Type + (1|Patient)

# Try:
~ Age + Smoking + Gender + Tissue_Type + (1|Patient)
  1. Increase filtering thresholds to focus on more robust features:
linda(..., 
      prev.filter = 0.2,  # Keep features present in ≥20% samples
      mean.abund.filter = 1e-5)  # Filter very low abundance
  1. Try different zero-handling methods:
# Force pseudo-count approach
linda(..., adaptive = FALSE, zero.handling = "pseudo-count")

# Or force imputation
linda(..., adaptive = FALSE, zero.handling = "imputation")
  1. Validate with permutation tests for significant hits

Alternative Approaches:

Consider complementary methods for validation:

  • ANCOM-BC: Specifically designed for compositional data
  • MaAsLin2: Alternative modeling framework
  • DESeq2 on raw counts (if available)

Technical Note

This is not a bug in LinDA, but rather a reflection of your data's structure approaching the method's limitations. The LinDA paper recommends having sufficient features (ideally 100+) for stable bias estimation.

Diagnostic Scripts

I've created test scripts that you can adapt for your data:

Would you be willing to share your data (or a subset) so I can provide more specific recommendations? Understanding the actual distribution of effect sizes in your data would help determine if the p-value pattern reflects:

  1. Calibration issues (fixable), or
  2. True biological signal (expected)

Please let me know if you'd like me to help analyze your specific dataset or if you have questions about any of the recommendations.

Best regards,
Chen Yang

cafferychen777 avatar Nov 10 '25 06:11 cafferychen777