ampliseq
ampliseq copied to clipboard
Phylogenetic placement of ASVs
Phylogenetic placement, using Pplacer, is a better way of getting a phylogenetic tree of short amplicons than making a phylogeny from an alignment of the ASVs on their own.
There's a QIIME2 module called q2-fragment-insertion for the purpose, or we can roll our own (see below). The builtin phylogeny in the QIIME2 module is "the 99% OTU Greengenes 13.8 tree with 203,452 tips", but I think one can specify ones own tree. Greengenes is, as you know, outdated.
Rolling our own would allow users to provide an alignment or hmm plus a phylogeny. The builtin would, I suggest, use GTDB species representative genome data. A problem here is that GTDB provides SSU sequences, but not a phylogeny built on them. We would therefore either have to estimate the phylogeny or assume that the protein alignment-based phylogeny is correct and build a reference SSU alignment from the sequences they provide.
Processing outline for GTDB:
- Decide which ASVs are archaeal and bacterial respectively. Can be done based on the taxonomic classification I think.
- Fetch SSU representative genome SSU sequences for archaea and bacteria respectively.
- Fetch the GTDB phylogenies. As far as I can see, only the full phylos are available, so we need to subset to the genomes present in the SSU sequences; this can be done with R.
- Align GTDB SSU sequences and ASVs. I suggest we use
hmmalign
to align to the hmms used e.g. by Barrnap and output only positions that match the profile. *) - Run Pplacer on the subset GTDB phylo and the alignment. *)
- Extract a phylogeny and subset to only ASVs
*) Candidates for nf-core modules.
I've done some testing that I'm summarizing here.
- Stage the following files: https://raw.githubusercontent.com/tseemann/barrnap/master/db/arc.hmm https://raw.githubusercontent.com/tseemann/barrnap/master/db/bac.hmm https://data.gtdb.ecogenomic.org/releases/latest/genomic_files_all/ssu_all.tar.gz (or the representatives, though quite few archaeal SSU sequences from representative genomes are complete enough). https://data.gtdb.ecogenomic.org/releases/latest/ar122.tree.gz https://data.gtdb.ecogenomic.org/releases/latest/bac120.tree.gz
- Extract the 16S hmm from the barrnap multi-hmms and extract archaeal and bacterial respectively from the
ssu_all.fna
hmmfetch arc.hmm 16S_rRNA > arc.16S_rRNA.hmm
hmmfetch bac.hmm 16S_rRNA > bac.16S_rRNA.hmm
grep -A 1 'd__Archaea' ssu_all_r95.fna | grep -v '^--' | sed 's/ .*//' > ar122_full.fna
grep -A 1 'd__Bacteria' ssu_all_r95.fna | grep -v '^--' | sed 's/ .*//' > bac120_full.fna
- Extract archaeal and bacterial sequences from
sequences.fasta
usingtaxonomy.tsv
toarcq.fna
andbacq.fna
respectively (R) - Align the GTDB sequences and the query sequences to the Barrnap hmms:
hmmalign arc.16S_rRNA.hmm ar122_full.fna | esl-alimask --rf-is-mask - | esl-alimanip --rffract 0.5 - | esl-reformat -n afa - > ar122_full.alnfna
hmmalign bac.16S_rRNA.hmm bac120_full.fna | esl-alimask --rf-is-mask - | esl-alimanip --rffract 0.5 - | esl-reformat -n afa - > bac120_full.alnfna
hmmalign arc.16S_rRNA.hmm arcq.fna | esl-alimask --rf-is-mask - | esl-reformat -n afa - > arcq.alnfna
hmmalign bac.16S_rRNA.hmm bacq.fna | esl-alimask --rf-is-mask - | esl-reformat -n afa - > bacq.alnfna
- Subset trees to the taxa present in alignments
#!/usr/bin/env Rscript
library(Biostrings)
library(ape)
s <- readDNAStringSet('ar122_full.alnfna')
t <- read.tree('ar122.tree.gz')
dl <- t$tip.label[!t$tip.label %in% names(s)]
write.tree(drop.tip(t, dl), 'ar122_full.newick')
# Repeat for bacteria (separate process)
- Run EPA-ng to place queries. (Supposedly faster, and I could get it to work.)
epa-ng -t ar122_full.newick -s ar122_full.alnfna -q arcq.alnfna -m GTR+G
# Repeat for bacteria (separate process)
# The above results in a epa_result.jplace file
- Gappa can be used to create a tree with both reference and query sequences:
gappa examine graft --jplace-path epa_result.jplace
Gappa can also be used to create Unifrac, aka Kantorovich-Rubinstein (KR), distances and many other things, so the epa_result.jplace
should be output to results
.
Phylogenetic placement is now implemented in nf-core subworkflow fasta_newick_epang_gappa
and in https://nf-co.re/phyloplace. Would you like to attempt to integrate it into ampliseq?
Absolutely. There are a couple of things we should discuss regarding the implementation, particularly with regards to reference trees and alignments. I thought we could discuss that when we meet next month.
Alright!
This issue was broken into https://github.com/nf-core/ampliseq/issues/561 & https://github.com/nf-core/ampliseq/issues/562, so I close this one here.