Issue with unclassified taxa Inflating DA analysis at specified taxonomic levels in ANCOM-BC2
It seems that when performing differential abundance (DA) analysis at the species level, for example, using ANCOM-BC2, ASVs/OTUs not classified at this level are aggregated into higher taxonomic ranks (e.g., genus or family) and included in the analysis. This significantly increases the number of taxa tested, leading to inflated adjusted p-values due to the multiple comparisons, ultimately reducing sensitivity to detect true differences. Additionally, since ANCOM-BC2 uses each taxon as a potential denominator for comparisons, the inclusion of higher-level taxa distorts the analysis at the species level. This problem is not limited to species; it also affects genus-level analysis and any specified taxonomic level where ASVs/OTUs lack proper classification.
Perhaps filtering out ASVs/OTUs without the target taxonomic classification before running ANCOM-BC2 would address this issue? Importantly, we have observed that removing these unclassified taxa also alters the log-fold changes (LFC) of the remaining taxa, suggesting that their inclusion can significantly impact the DA results. A built-in option to automatically exclude taxa lacking classification at the specified level would help ensure more accurate and meaningful DA analysis.
I am facing the same situation, in my ANCOMBC2 test many of my differentially abundant taxa are unknow taxa at genus level, if these unassigned taxa are filtered out my results change and, as expected I obtain a new result with know taxa. Attach the results of my analysis. I was thinking in keep all my taxa for alpha and beta diversity metrics but filtered out these unassigned sequence for my ANCOMBC2 analysis. Any suggestion? @FrederickHuangLin
I was thinking in keep all my taxa for alpha and beta diversity metrics but filtered out these unassigned sequence for my ANCOMBC2 analysis.
@DavidRR24 The problem with removing assigned sequences is that it might distort the sampling fractions estimates which is a function of the total read counts and total number of taxa. In fact, the unknown sampling fractions is estimated by taking the log-transformed feature table and centering it by row means. Grouping together sequences unassigned at your desired taxa level is a better option than filtering. This method prevents inflated adjusted p-values, preserves total read counts and may provide more accurate estimation of the sampling fractions. However, it could deflate the number of taxa by not considering the potentially phylogenetic distinct taxa that are unclassified. Perhaps the reason ANCOM-BC2 aggregated unclassified taxa into higher taxonomic ranks is to have a more accurate estimation of the total number of taxa.
Iam sorry for my delay in my response. I agree with your argument why taxa that have not been classified to a specific taxonomic rank should not be deleted and I would add the following ideas that came up after consulting this same topic in Qiime's forum. I decided to keep all ASVs classified or unclassified to genus level. I manually edited them as “”Unclassified” followed by the highest taxonomic rank to which they could be classified and in case I had several ASVs I listed them consecutively to keep the taxonomic resolution intact, and avoid having several unclassified ASVs grouped together in an artificial taxonomic rank. For example, I had 20 ASVs from the phylum Zixibacteria that do not have a higher taxonomic classification, by default the ANCOMBC2 analysis at the genus level grouped them all together for me as if they were a single taxa “Zixibacteria” at the genus level, which is an artifact. To solve this, the 20 ASVs in the phylum Zixibacteria I classified them as “Unclassified_Zixibacteria_1”, “Unclassified_Zixibacteria_2”..... “Unclassified_Zixibacteria_20”. This way I don't delete them, I avoid their artificial grouping in a single genus (we don't know if they belong to the same genus or not) and I keep the resolution of my database intact.