get_phylomarkers icon indicating copy to clipboard operation
get_phylomarkers copied to clipboard

A pipeline to select optimal markers for microbial phylogenomics and species tree estimation using the multispecies coalescent and concatenation approaches

GET_PHYLOMARKERS

Build Status Publication GPLv3-like license DockerHub

GET_PHYLOMARKERS (Vinuesa et al., 2018) is primarily an open-source software package to select optimal markers for microbial phylogenomics and species tree estimation using the multispecies coalescent and concatenation approaches. It implements a bioinformatics pipeline to filter orthologous gene clusters computed by the companion package GET_HOMOLOGUES to select those with optimal attributes for phylogenetic inference using maximum likelihood (ML). Multiple sequence alignments of loci passing the filters are concatenated into a supermatrix to estimate a species tree under the ML criterion using the state-of-the-art fast ML tree searching algorithms FastTree or IQ-TREE. We have also tested it with plant coding sequences (details here).

Starting with release 2.0.0 (2022-11-20), a concatenation-independent species tree is computed from the ML gene trees estimated from top-scoring alignments using ASTRAL-III.

Release 2.1.0 (2024-03-31) implemented the maximal matched-pairs tests to asses violations of the data to the Stationarity, Reversibility, and Homogeneity (SRH) assumptions made by maximum-likelihood phylogenetic models, as implemented in IQ-TREE. In addition, ASTRAL-IV is used to estimate the species tree directly from the filtered ML source gene trees, which computes terminal and internal branch lengths in substitution-per-site units.

Release 2.2.0 (v2.2.0, 2024-04-14) contains new features, significant code improvements, binary updates, and some bug fixes. Among the new features, we would like to highlight the following ones:

-significant extension of run mode 2 (run_get_phylomarkers_pipeline.sh -R 2) for population genetics (multiple sequences from the same species):

  • run_get_phylomarkers_pipeline.sh now also calls the C binary WEIGHTED-ASTRAL to estimate a species tree using as input the filtered gene trees estimated by iqtree2 or FastTree from the core-genome clusters computed by get_homologues.

  • The main script run_get_phylomarkers_pipeline.sh added complex protein mixture models for concatenated protein alignments, prints script invocation arguments to STDOU at the beginning of the run, and prints a summary results of the different filtering steps before exiting. Read the Release v2.2.0) notes for further details.

  • Users of this release interested in run mode 2 (-R 2) to compute a population tree, will need to install snp-sites, as explained in the current INSTALL.md notes.

  • Linux users that cloned the the GET_PHYLOMARKERS GitHub repository on 2024-04-15, or later, will not have to install snp-sites manually, because a static binary is provided (bin/linux/snp-sites-static), which is called by run_get_phylomarkers_pipeline.sh v2.8.1.0_2024-04-15. Thanks to Alfredo Hernández @ccg_unam for compiling snp-sites-static.

  • Release 2.2.0 (2024-04-14) was used to build the latest Docker image ready to pull from Docker Hub. On Dockerhub, you will find detailed instructions on installing and configuring the Docker client on your machine, pulling the latest image, and running the containerized instance of the GET_PHYLOMARKERS pipeline.

GET_PHYLOMARKERS can also estimate ML and parsimony trees from the pan-genome matrix, including unsupervised learning methods to determine the optimal number of clusters from the pan-genome and average genomic distance matrices. A detailed manual and step-by-step tutorials document the software and help the user to get up and running quickly. For convenience, html and markdown versions of the documentation material are available.

Installation, dependencies and Docker image

For detailed instructions and dependencies please check INSTALL.md.

A GET_PHYLOMARKERS Docker image is available, as well as an image bundling GET_PHYLOMARKERS + GET_HOMOLOGUES, ready to use. Detailed instructions for setting up the Docker environment are provided in INSTALL.md. How to run container instances with the test sequences distributed with GET_PHYLOMARKERS is described in the tutorial.

Aim

GET_PHYLOMARKERS (Vinuesa et al. 2018) implements a series of sequential filters (detailed below) to selects markers from the homologous gene clusters produced by GET_HOMOLOGUES with optimal attributes for phylogenomic inference. It estimates gene-trees and species-trees under the maximum likelihood (ML) optimality criterion using state-of-the-art fast ML tree searching algorithms. The species tree is estimated from the supermatrix of concatenated, top-scoring alignments that passed the quality filters outlined in the figures below and explained in detail in the manual and publication.

flowchart Filtering actions. GET_PHYLOMARKERS

Figure 1A. Simplified flow-chart of the GET_PHYLOMARKERS pipeline showing only those parts used and described in this work. The left branch, starting at the top of the diagram, is fully under control of the master script run_get_phylomarkes_pipeline.sh. The names of the worker scripts called by the master program are indicated on the relevant points along the flow, as detailed in the manual. The image corresponds to Fig. 1 of Vinuesa et al. 2018.

Figure 1B. Combined filtering actions performed by GET_HOMOLOGUES and GET_PHYLOMARKERS to select top-ranking phylogenetic markers to be concatenated for phylogenomic analyses, and benchmark results of the performance of the FastTree (FT) and IQ-TREE (IQT) maximum-likelihood (ML) phylogeny inference programs. The image corresponds to Fig. 3 of Vinuesa et al. 2018.

GET_HOMOLOGUES is a genome-analysis software package for microbial pan-genomics and comparative genomics originally described in the following publications:

More recently we developed GET_HOMOLOGUES-EST, which can be used to cluster eukaryotic genes and transcripts, as described in Contreras-Moreira et al, Front. Plant Sci. 2017.

If GET_HOMOLOGUES_EST is fed both .fna and .faa files of CDS sequences it will produce identical output to that of GET_HOMOLOGUES and thus can be analyzed with GET_PHYLOMARKERS all the same.


GET_PHYLOMARKERS is primarily tailored towards selecting CDSs (gene markers) to infer DNA-level phylogenies of different species of the same genus or family. It can also select optimal markers for population genetics, when the source genomes belong to the same species (Vinuesa et al. 2018). For more divergent genome sequences, classified in different genera, families, orders or higher taxa, the pipeline should be run using protein instead of DNA sequences.

ML core-genome phylogeny of Stenotrophomonas ML core-genome phylogeny of Stenotrophomonas

Figure 2A. Best maximum-likelihood core-genome phylogeny for the genus Stenotrophomonas found in the IQ-TREE search, based on the supermatrix obtained by concatenation of 55 top-ranking alignments. The image corresponds to Fig. 5 of Vinuesa et al. 2018.

Figure 2B. Maximum-likelihood pan-genome phylogeny estimated with IQ-TREE from the consensus pan-genome clusters displayed in the Venn diagram. Clades of lineages belonging to the S. maltophilia complex are collapsed and are labeled as in Figure 2A. Numbers on the internal nodes represent the approximate Bayesian posterior probability/UFBoot2 bipartition support values (see methods). The tabular inset shows the results of fitting either the binary (GTR2) or morphological (MK) models implemented in IQ-TREE, indicating that the former has an overwhelmingly better fit. The scale bar represents the number of expected substitutions per site under the binary GTR2+F0+R4 substitution model. The image corresponds to Fig. 6 of Vinuesa et al. 2018.


Manual and tutorials

Please, follow the links for a detailed manual and tutorials, including a graphical flowchart of the pipeline and explanations of the implementation details. See also our plant tutorial.

Citation.

Pablo Vinuesa, Luz-Edith Ochoa-Sanchez and Bruno Contreras-Moreira (2018). GET_PHYLOMARKERS, a software package to select optimal orthologous clusters for phylogenomics and inferring pan-genome phylogenies, used for a critical geno-taxonomic revision of the genus Stenotrophomonas. Front. Microbiol. | doi: 10.3389/fmicb.2018.00771

Published in the Research Topic on "Microbial Taxonomy, Phylogeny and Biodiversity" http://journal.frontiersin.org/researchtopic/5493/microbial-taxonomy-phylogeny-and-biodiversity

Code

Developers

The code is developed and maintained by Pablo Vinuesa at CCG-UNAM, Mexico and Bruno Contreras-Moreira at EEAD-CSIC, Spain.

Acknowledgements

We thank Alfredo J. Hernández and Víctor del Moral at CCG-UNAM for technical support with server administration.

Funding

We gratefully acknowledge the funding provided over the years by DGAPA-PAPIIT/UNAM (grants IN201806-2, IN211814, IN206318, and IN216424) and CONAHCyT-Mexico (grants P1-60071, 179133 and A1-S-11242) to Pablo Vinuesa, as well as the Fundación ARAID,Consejo Superior de Investigaciones Científicas (grant 200720I038 and Spanish MINECO (AGL2013-48756-R) to Bruno Contreras-Moreira.