tourmaline
tourmaline copied to clipboard
Add generic taxonomy assignment using GenBank nt
Basic workflow:
- BLASTN with --outputfmt 11 (ASN)
- blast_formatter to get desired format
- Ignore "uncultured" hits (by keyword or taxid)
- Report NCBI taxomony
Code:
cd tourmaline/02-denoised/dada2-pe
# metablast and store as ASN.1 (convertible to any format)
blastn \
-query representative_sequences.fasta \
-db $EXTDRIVE/databases/blast/nt/nt \
-negative_gilist $DB/blast/nt/env_metag_unclassified.gi \
-task megablast \
-max_target_seqs 5 \
-max_hsps 1 \
-num_threads 24 \
-outfmt 11 \
-out representative_sequences_vs_nt.asn
# convert ASN.1 to XML
blast_formatter \
-archive representative_sequences_vs_nt.asn \
-outfmt 5 \
-out representative_sequences_vs_nt.xml
# parse XML to TSV (requires BioPython)
parse_blast_xml.py \
representative_sequences_vs_nt.xml 1 > \
representative_sequences_vs_nt.tsv
# map accessions to taxids (from A to Z),
# lookup lineages, and
# format as qiime2 taxonomy
# RUN: blastn_tsv_to_taxonomy_tsv.ipynb
# OUTPUT: representative_sequences_vs_nt_tax.tsv, representative_sequences_taxonomy.tsv
# import as qiime2 taxonomy artifact
qiime tools import \
--type 'FeatureData[Taxonomy]' \
--input-format TSVTaxonomyFormat \
--input-path representative_sequences_taxonomy.tsv \
--output-path representative_sequences_taxonomy.qza