tourmaline icon indicating copy to clipboard operation
tourmaline copied to clipboard

Add generic taxonomy assignment using GenBank nt

Open lukenoaa opened this issue 3 years ago • 0 comments

Basic workflow:

  • BLASTN with --outputfmt 11 (ASN)
  • blast_formatter to get desired format
  • Ignore "uncultured" hits (by keyword or taxid)
  • Report NCBI taxomony

Code:

cd tourmaline/02-denoised/dada2-pe

# metablast and store as ASN.1 (convertible to any format)
blastn \
-query representative_sequences.fasta \
-db $EXTDRIVE/databases/blast/nt/nt \
-negative_gilist $DB/blast/nt/env_metag_unclassified.gi \
-task megablast \
-max_target_seqs 5 \
-max_hsps 1 \
-num_threads 24 \
-outfmt 11 \
-out representative_sequences_vs_nt.asn

# convert ASN.1 to XML
blast_formatter \
-archive representative_sequences_vs_nt.asn \
-outfmt 5 \
-out representative_sequences_vs_nt.xml

# parse XML to TSV (requires BioPython)
parse_blast_xml.py \
representative_sequences_vs_nt.xml 1 > \
representative_sequences_vs_nt.tsv

# map accessions to taxids (from A to Z),
# lookup lineages, and
# format as qiime2 taxonomy
# RUN: blastn_tsv_to_taxonomy_tsv.ipynb
# OUTPUT: representative_sequences_vs_nt_tax.tsv, representative_sequences_taxonomy.tsv

# import as qiime2 taxonomy artifact
qiime tools import \
--type 'FeatureData[Taxonomy]' \
--input-format TSVTaxonomyFormat \
--input-path representative_sequences_taxonomy.tsv \
--output-path representative_sequences_taxonomy.qza

lukenoaa avatar Mar 12 '21 22:03 lukenoaa