MicrobiomeStat LinDA Unusual p-value Distribution

Using the data set shared previously via e-mail and fitting

adjNormalCancerFit <- linda(bacteriaMatrix, clinicalTablePairs, "~ Age * `Tissue Type` + Smoking * `Tissue Type` +
                            Gender * `Tissue Type` + `Primary Site` + (1|Patient)", "proportion", is.winsor = FALSE)

I get a strange-looking p-value distribution for many of the coefficients. For example,

How can it be made to be more uniform, as expected by statistics theory?

May 29 '24 23:05 DarioS

Hi @DarioStrbenac,

Thank you for reporting this issue. We've conducted extensive testing of the p-value distribution in LinDA, and I'd like to share our findings:

Test Results We created a comprehensive test with simulated null data (no true associations) and found that the p-value distribution can be improved by:

Increasing sample size (200+ samples showed better results)
Having a sufficient number of features (100+ features)

Our test metrics showed:

KS test p-values > 0.03 (indicating no significant deviation from uniform)
Density ratios around 2-3 (showing reasonable uniformity)
Coefficient of variation < 0.35 (indicating good spread)

Recommendations for Your Analysis Based on your formula:

~ Age * `Tissue Type` + Smoking * `Tissue Type` + Gender * `Tissue Type` + `Primary Site` + (1|Patient)

To improve the p-value distribution, consider:

a) Sample Size: Ensure you have sufficient samples relative to the number of parameters in your model

Your model has multiple interaction terms, which increases the number of parameters
Aim for at least 200 samples if possible

b) Model Complexity:

Consider if all interaction terms are necessary
You might want to try a simpler model first and add complexity gradually

c) Data Quality:

Check for balanced groups in categorical variables
Consider the distribution of continuous variables (Age)
Examine potential outliers

Diagnostic Steps You can try:

# Check sample sizes in each group
table(clinicalTablePairs$`Tissue Type`)
table(clinicalTablePairs$Smoking)
table(clinicalTablePairs$Gender)

# Look at Age distribution
hist(clinicalTablePairs$Age)

# Consider a simpler model first
simpleModel <- linda(bacteriaMatrix, clinicalTablePairs, 
                    "~ Age + `Tissue Type` + Smoking + Gender + (1|Patient)", 
                    "proportion", is.winsor = FALSE)

Could you share:

How many samples do you have in total?
How are they distributed across your categorical variables?
What's the distribution of your Age variable?

This information would help us provide more specific recommendations for your case.

Best regards, Caffery

Nov 04 '24 20:11 cafferychen777

Thank you for evaluating it so thoroughly. Data set is 105 samples and 54 species. Age is categorical, not continuous.

> table(clinicalTablePairs$Age)
Young   Old 
   64    41
> table(clinicalTablePairs$`Tissue Type`)
Normal Cancer 
    51     54
> table(clinicalTablePairs$Smoking)
 No Yes 
 74  31
 > table(clinicalTablePairs$Gender)
Female   Male 
    38     67
> table(Tissue = clinicalTablePairs$`Tissue Type`, Smoking = clinicalTablePairs$Smoking)
        Smoking
Tissue   No Yes
  Normal 36  15
  Cancer 38  16
> table(Tissue = clinicalTablePairs$`Tissue Type`, Gender = clinicalTablePairs$Gender)
        Gender
Tissue   Female Male
  Normal     19   32
  Cancer     19   35

Fitting the simpler model with only main effects, I also see a strange histogram.

Nov 06 '24 10:11 DarioS

Hi @DarioS,

Thank you for your patience. I've conducted extensive testing to investigate the p-value distribution issue you reported.

Test Results

I've created a comprehensive diagnostic script that simulates NULL data (no true differences) with parameters matching your study (n=105 samples, m=54 features). Here are the key findings:

Test 1: Your Data Configuration (n=105, m=54, paired design)

Simple Model: ~ Tissue_Type + (1|Patient)

KS test p-value: 0.0339 ⚠️ (significant deviation from uniform)
Density ratio (max/min bins): 6 (should be < 3)
Coefficient of variation: 0.547 (should be < 0.3)

This confirms the non-uniform p-value distribution you observed.

Test 2: Sample Size Effect

Testing with 50, 100, 200, and 400 samples:

All sample sizes passed KS test (p-values: 0.569, 0.983, 0.107, 0.337)
No systematic issues with sample size alone

Test 3: Feature Count Effect

Testing with 20, 50, 100, and 200 features:

All feature counts passed KS test (p-values: 0.772, 0.272, 0.732, 0.745)
No systematic issues with feature count alone

Root Cause Analysis

The p-value distribution issue in your case is likely due to a combination of factors:

Marginal sample size (105 samples with paired structure = ~52-53 independent units)
Low feature count (54 species) for stable bias estimation
Paired design complexity reducing effective degrees of freedom
Potential non-uniform effect sizes in your real data (not all null)

LinDA's bias correction relies on estimating the mode of regression coefficients across features, assuming most features are null. With only 54 features and potential true biological differences, this assumption may not hold well.

Technical Note

This is not a bug in LinDA, but rather a reflection of your data's structure approaching the method's limitations. The LinDA paper recommends having sufficient features (ideally 100+) for stable bias estimation.

Diagnostic Scripts

I've created test scripts that you can adapt for your data:

test_linda_pvalue_distribution.R (if you'd like to share them)

Would you be willing to share your data (or a subset) so I can provide more specific recommendations? Understanding the actual distribution of effect sizes in your data would help determine if the p-value pattern reflects:

Calibration issues (fixable), or
True biological signal (expected)

Please let me know if you'd like me to help analyze your specific dataset or if you have questions about any of the recommendations.

Best regards,
Chen Yang

Nov 10 '25 06:11 cafferychen777

MicrobiomeStat
MicrobiomeStat copied to clipboard

LinDA Unusual p-value Distribution

Test Results

Test 1: Your Data Configuration (n=105, m=54, paired design)

Test 2: Sample Size Effect

Test 3: Feature Count Effect

Root Cause Analysis

Recommended Solutions

Immediate Actions:

Alternative Approaches:

Technical Note

Diagnostic Scripts

MicrobiomeStat MicrobiomeStat copied to clipboard

LinDA Unusual p-value Distribution

Test Results

Test 1: Your Data Configuration (n=105, m=54, paired design)

Test 2: Sample Size Effect

Test 3: Feature Count Effect

Root Cause Analysis

Recommended Solutions

Immediate Actions:

Alternative Approaches:

Technical Note

Diagnostic Scripts

MicrobiomeStat
MicrobiomeStat copied to clipboard