unite-train
unite-train copied to clipboard
Taxonomy curation
UNITE also contains quite some unannotated accessions that probably do not make much sense to keep in for a classifier, so some optimization is in order, as we showed here: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009581
I should have read the RESCRIPt paper sooner!
These results indicate that, even in curated release versions of some public databases, some additional curation is beneficial to remove sequences with missing or uninformative taxonomic labels.
filtered to contain only Fungi with at least order-level taxonomic annotation ("Fungi Order")
+10% to f-score at some levels, which is nuts
# at least order-level taxonomic annotation??
qiime rescript filter-taxa \
--i-taxonomy taxonomy.qza \
--p-exclude 'o_undefined' \
--o-filtered-taxonomy taxonomy-filtered.qza
# keep all but remove last Species Hypothesis suffixes
qiime rescript edit-taxonomy \
--i-taxonomy taxonomy.qza \
--o-edited-taxonomy taxonomy-no-SH.qza \
--p-search-strings ';sh__.*' \
--p-replacement-strings '' \
--p-use-regex
qiime metadata tabulate \
--m-input-file taxonomy.qza \
--m-input-file taxonomy-filtered.qza \
--m-input-file taxonomy-no-SH.qza \
--o-visualization taxonomy-compare.qzv