unite-train icon indicating copy to clipboard operation
unite-train copied to clipboard

Taxonomy curation

Open colinbrislawn opened this issue 2 months ago • 0 comments

UNITE also contains quite some unannotated accessions that probably do not make much sense to keep in for a classifier, so some optimization is in order, as we showed here: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009581

I should have read the RESCRIPt paper sooner!

These results indicate that, even in curated release versions of some public databases, some additional curation is beneficial to remove sequences with missing or uninformative taxonomic labels.

filtered to contain only Fungi with at least order-level taxonomic annotation ("Fungi Order")

+10% to f-score at some levels, which is nuts

# at least order-level taxonomic annotation??
qiime rescript filter-taxa \
  --i-taxonomy taxonomy.qza \
  --p-exclude 'o_undefined' \
  --o-filtered-taxonomy taxonomy-filtered.qza

# keep all but remove last Species Hypothesis suffixes 
qiime rescript edit-taxonomy \
    --i-taxonomy taxonomy.qza \
    --o-edited-taxonomy taxonomy-no-SH.qza \
    --p-search-strings ';sh__.*' \
    --p-replacement-strings '' \
    --p-use-regex

qiime metadata tabulate \
    --m-input-file taxonomy.qza \
    --m-input-file taxonomy-filtered.qza \
    --m-input-file taxonomy-no-SH.qza \
    --o-visualization taxonomy-compare.qzv

colinbrislawn avatar Oct 29 '25 20:10 colinbrislawn