ampliseq icon indicating copy to clipboard operation
ampliseq copied to clipboard

Phylogenetic placement of ASVs

Open erikrikarddaniel opened this issue 3 years ago • 1 comments

Phylogenetic placement, using Pplacer, is a better way of getting a phylogenetic tree of short amplicons than making a phylogeny from an alignment of the ASVs on their own.

There's a QIIME2 module called q2-fragment-insertion for the purpose, or we can roll our own (see below). The builtin phylogeny in the QIIME2 module is "the 99% OTU Greengenes 13.8 tree with 203,452 tips", but I think one can specify ones own tree. Greengenes is, as you know, outdated.

Rolling our own would allow users to provide an alignment or hmm plus a phylogeny. The builtin would, I suggest, use GTDB species representative genome data. A problem here is that GTDB provides SSU sequences, but not a phylogeny built on them. We would therefore either have to estimate the phylogeny or assume that the protein alignment-based phylogeny is correct and build a reference SSU alignment from the sequences they provide.

Processing outline for GTDB:

  1. Decide which ASVs are archaeal and bacterial respectively. Can be done based on the taxonomic classification I think.
  2. Fetch SSU representative genome SSU sequences for archaea and bacteria respectively.
  3. Fetch the GTDB phylogenies. As far as I can see, only the full phylos are available, so we need to subset to the genomes present in the SSU sequences; this can be done with R.
  4. Align GTDB SSU sequences and ASVs. I suggest we use hmmalign to align to the hmms used e.g. by Barrnap and output only positions that match the profile. *)
  5. Run Pplacer on the subset GTDB phylo and the alignment. *)
  6. Extract a phylogeny and subset to only ASVs

*) Candidates for nf-core modules.

erikrikarddaniel avatar Mar 23 '21 09:03 erikrikarddaniel

I've done some testing that I'm summarizing here.

  1. Stage the following files: https://raw.githubusercontent.com/tseemann/barrnap/master/db/arc.hmm https://raw.githubusercontent.com/tseemann/barrnap/master/db/bac.hmm https://data.gtdb.ecogenomic.org/releases/latest/genomic_files_all/ssu_all.tar.gz (or the representatives, though quite few archaeal SSU sequences from representative genomes are complete enough). https://data.gtdb.ecogenomic.org/releases/latest/ar122.tree.gz https://data.gtdb.ecogenomic.org/releases/latest/bac120.tree.gz
  2. Extract the 16S hmm from the barrnap multi-hmms and extract archaeal and bacterial respectively from the ssu_all.fna
hmmfetch arc.hmm 16S_rRNA > arc.16S_rRNA.hmm
hmmfetch bac.hmm 16S_rRNA > bac.16S_rRNA.hmm
grep -A 1 'd__Archaea' ssu_all_r95.fna | grep -v '^--' | sed 's/ .*//' > ar122_full.fna
grep -A 1 'd__Bacteria' ssu_all_r95.fna | grep -v '^--' | sed 's/ .*//' > bac120_full.fna
  1. Extract archaeal and bacterial sequences from sequences.fasta using taxonomy.tsv to arcq.fna and bacq.fna respectively (R)
  2. Align the GTDB sequences and the query sequences to the Barrnap hmms:
hmmalign arc.16S_rRNA.hmm ar122_full.fna  | esl-alimask --rf-is-mask - | esl-alimanip --rffract 0.5 - | esl-reformat -n afa - > ar122_full.alnfna
hmmalign bac.16S_rRNA.hmm bac120_full.fna | esl-alimask --rf-is-mask - | esl-alimanip --rffract 0.5 - | esl-reformat -n afa - > bac120_full.alnfna
hmmalign arc.16S_rRNA.hmm arcq.fna | esl-alimask --rf-is-mask - | esl-reformat -n afa - > arcq.alnfna
hmmalign bac.16S_rRNA.hmm bacq.fna | esl-alimask --rf-is-mask - | esl-reformat -n afa - > bacq.alnfna
  1. Subset trees to the taxa present in alignments
#!/usr/bin/env Rscript

library(Biostrings)
library(ape)

s <- readDNAStringSet('ar122_full.alnfna')
t <- read.tree('ar122.tree.gz')

dl <- t$tip.label[!t$tip.label %in% names(s)]

write.tree(drop.tip(t, dl), 'ar122_full.newick')
# Repeat for bacteria (separate process)
  1. Run EPA-ng to place queries. (Supposedly faster, and I could get it to work.)
epa-ng -t ar122_full.newick -s ar122_full.alnfna -q arcq.alnfna -m GTR+G
# Repeat for bacteria (separate process)
# The above results in a epa_result.jplace file
  1. Gappa can be used to create a tree with both reference and query sequences:
gappa examine graft --jplace-path epa_result.jplace

Gappa can also be used to create Unifrac, aka Kantorovich-Rubinstein (KR), distances and many other things, so the epa_result.jplace should be output to results.

erikrikarddaniel avatar Mar 25 '21 20:03 erikrikarddaniel

Phylogenetic placement is now implemented in nf-core subworkflow fasta_newick_epang_gappa and in https://nf-co.re/phyloplace. Would you like to attempt to integrate it into ampliseq?

d4straub avatar Feb 17 '23 12:02 d4straub

Absolutely. There are a couple of things we should discuss regarding the implementation, particularly with regards to reference trees and alignments. I thought we could discuss that when we meet next month.

erikrikarddaniel avatar Feb 17 '23 12:02 erikrikarddaniel

Alright!

d4straub avatar Feb 17 '23 12:02 d4straub

This issue was broken into https://github.com/nf-core/ampliseq/issues/561 & https://github.com/nf-core/ampliseq/issues/562, so I close this one here.

d4straub avatar Mar 27 '23 11:03 d4straub