http://www.metapiga.org/welcome.html
MetaPIGA 2 is a robust implementation of several stochastic heuristics for large phylogeny inference (under maximum likelihood), including a random-restart hill climbing, a simulated annealing algorithm, a classical genetic algorithm, and the metapopulation genetic algorithm (metaGA) together with complex substitution models, discrete Gamma rate heterogeneity, and the possibility to partition data. MetaPIGA 2 handles nucleic-acid and protein datasets as well as morphological (presence/absence) data. The benefits of the metaGA (
Lemmon & Milinkovitch 2002; PNAS, 99: 10516-10521) are as follows: (
i) it resolves the major problem inherent to classical Genetic Algorithms (
i.e., the need to choose between strong selection, hence, speed, and weak selection, hence, accuracy) by maintaining high inter-population variation even under strong intra-population selection, and (
ii) it can generate branch support values that approximate posterior probabilities.
The software
MetaPIGA 2 also implements:
- Simple dataset quality control (testing for the presence of identical sequences as well as for excessively ambiguous or excessively divergent sequences);
- Automated trimming of poorly aligned regions using the trimAl algorithm;
- The Likelihood Ratio Test, the Akaike Information Criterion, and the Bayesian Information Criterion for the easy selection of nucleotide and amino-acid substitution models that best fit the data;
- Ancestral-state reconstruction of all nodes in the tree.
MetaPIGA 2 provides high customization of heuristics' and models' parameters, manual batch file and command line processing. However, it also offers an extensive and ergonomic graphical user interface and functionalities assisting the user for dataset quality testing, parameters setting, generating and running batch files, following run progress, and manipulating result trees.
MetaPIGA 2 uses standard formats for data sets and trees, is platform independent, runs in 32- and 64-bits systems, and takes advantage of multiprocessor and/or multicore computers. A version for Grid computing is in development.
Citing MetaPIGA 2
MetaPIGA v2.0: maximum likelihood large phylogeny estimation using the metapopulation genetic algorithm and other stochastic heuristics
Raphaël Helaers & Michel C. Milinkovitch
BMC Bioinformatics 2010, 11:379
http://bioinformatics.oxfordjournals.org/content/25/2/197.full
Phylogenetic inference under recombination using Bayesian stochastic topology selection
Abstract
Motivation: Conventional phylogenetic analysis for characterizing the relatedness between taxa typically assumes that a single relationship exists between species at every site along the genome. This assumption fails to take into account recombination which is a fundamental process for generating diversity and can lead to spurious results. Recombination induces a localized phylogenetic structure which may vary along the genome. Here, we generalize a hidden Markov model (HMM) to infer changes in phylogeny along multiple sequence alignments while accounting for rate heterogeneity; the hidden states refer to the unobserved phylogenic topology underlying the relatedness at a genomic location. The dimensionality of the number of hidden states (topologies) and their structure are random (not known a priori) and are sampled using Markov chain Monte Carlo algorithms. The HMM structure allows us to analytically integrate out over all possible changepoints in topologies as well as all the unknown branch lengths.
Results: We demonstrate our approach on simulated data and also to the genome of a suspected HIV recombinant strain as well as to an investigation of recombination in the sequences of 15 laboratory mouse strains sequenced by Perlegen Sciences. Our findings indicate that our method allows us to distinguish between rate heterogeneity and variation in phylogeny caused by recombination without being restricted to 4-taxa data.
Availability: The method has been implemented in JAVA and is available, along with data studied here, from http://www.stats.ox.ac.uk/~webb.
Contact: cholmes@stats.ox.ac.uk
Supplementary information: Supplementary data are available at Bioinformatics online.
http://www.stats.ox.ac.uk/__data/assets/pdf_file/0005/4010/large_pedigrees.pdf
http://www.cs.cmu.edu/~guestrin/Class/10701-S07/Handouts/recitations/HMM-inference.pdf
Probabilistic Phylogenetic Inference with Insertions and Deletions
Abstract Top
A fundamental task in sequence analysis is to calculate the probability of a multiple alignment given a phylogenetic tree relating the sequences and an evolutionary model describing how sequences change over time. However, the most widely used phylogenetic models only account for residue substitution events. We describe a probabilistic model of a multiple sequence alignment that accounts for insertion and deletion events in addition to substitutions, given a phylogenetic tree, using a rate matrix augmented by the gap character. Starting from a continuous Markov process, we construct a non-reversible generative (birth–death) evolutionary model for insertions and deletions. The model assumes that insertion and deletion events occur one residue at a time. We apply this model to phylogenetic tree inference by extending the program
DNAML in
PHYLIP. Using standard benchmarking methods on simulated data and a new "concordance test" benchmark on real ribosomal RNA alignments, we show that the extended program
DNAMLε improves accuracy relative to the usual approach of ignoring gaps, while retaining the computational efficiency of the Felsenstein peeling algorithm.
Author Summary Top
We describe a computationally efficient method to use insertion and deletion events, in addition to substitutions, in phylogenetic inference. To date, many evolutionary models in probabilistic phylogenetic inference methods have only accounted for substitution events, not for insertions and deletions. As a result, not only do tree inference methods use less sequence information than they could, but also it has remained difficult to integrate phylogenetic modeling into sequence alignment methods (such as profiles and profile-hidden Markov models) that inherently require a model of insertion and deletion events. Therefore an important goal in the field has been to develop tractable evolutionary models of insertion/deletion events over time of sufficient accuracy to increase the resolution of phylogenetic inference methods and to increase the power of profile-based sequence homology searches. Our model offers a partial answer to this problem. We show that our model generally improves inference power in both simulated and real data and that it is easily implemented in the framework of standard inference packages with little effect on computational efficiency (we extended
DNAML, in Felsenstein's popular
PHYLIP package).
Citation: Rivas E, Eddy SR (2008) Probabilistic Phylogenetic Inference with Insertions and Deletions. PLoS Comput Biol 4(9): e1000172. doi:10.1371/journal.pcbi.1000172
Editor: David Haussler, University of California Santa Cruz, United States of America
Received: October 24, 2007;
Accepted: July 31, 2008;
Published: September 19, 2008
Copyright: © 2008 Rivas, Eddy. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: This work was funded by the Howard Hughes Medical Institute.
Competing interests: The authors have declared that no competing interests exist.
* E-mail:
rivase@janelia.hhmi.org Materials and Methods Top
The C source code for the modified
PHYLIP 3.66 package
[14] that contains the program
DNAMLε , the C source code for evolving sequences with the generative model (
εRATE ), the modified
ROSE package (version 1.3)
[76], as well as all the Perl scripts and datasets used to generate the results presented in this paper are provided as a tarball in
Dataset S1. The program
DNAMLε uses the
EASEL sequence analysis library (SRE, unpublished) which is also provided.
Roland F. Schwarz, William Fletcher, Frank Förster, Benjamin Merget, Matthias Wolf, Jörg Schultz, and Florian Markowetz
PLoS One. 2010; 5(12): e15788. Published online 2010 December 31. doi: 10.1371/journal.pone.0015788
- PMCID:
- PMC3013127
Bhakti Dwivedi and Sudhindra R Gadagkar
BMC Evol Biol. 2009; 9: 211. Published online 2009 August 23. doi: 10.1186/1471-2148-9-211
- PMCID:
- PMC2746219
Title: Genomes as documents of evolutionary history
Author(s): Boussau B, Daubin V
Source: TRENDS IN ECOLOGY & EVOLUTION Volume: 25 Issue: 4 Pages: 224-232 Published: APR 2010
Times Cited: 2
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2746219/?tool=pmcentrez
Phylogenetic inference under varying proportions of indel-induced alignment gaps
Bhakti Dwivedi
1 and Sudhindra R Gadagkar
1,21Department of Biology, University of Dayton, 300 College Park, Dayton, OH 46469-2320, USA
2Department of Natural Sciences, PO Box 1004, 1400 Brush Row Rd, Wilberforce, Ohio 45384, USA
Received May 11, 2009; Accepted August 23, 2009.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (
http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Background
The effect of alignment gaps on phylogenetic accuracy has been the subject of numerous studies. In this study, we investigated the relationship between the total number of gapped sites and phylogenetic accuracy, when the gaps were introduced (by means of computer simulation) to reflect indel (insertion/deletion) events during the evolution of DNA sequences. The resulting (true) alignments were subjected to commonly used gap treatment and phylogenetic inference methods.
Results
(1) In general, there was a strong – almost deterministic – relationship between the amount of gap in the data and the level of phylogenetic accuracy when the alignments were very "gappy", (2) gaps resulting from deletions (as opposed to insertions) contributed more to the inaccuracy of phylogenetic inference, (3) the probabilistic methods (Bayesian, PhyML & "MLε, " a method implemented in DNAML in PHYLIP) performed better at most levels of gap percentage when compared to parsimony (MP) and distance (NJ) methods, with Bayesian analysis being clearly the best, (4) methods that treat gapped sites as missing data yielded less accurate trees when compared to those that attribute phylogenetic signal to the gapped sites (by coding them as binary character data – presence/absence, or as in the MLε method), and (5) in general, the accuracy of phylogenetic inference depended upon the amount of available data when the gaps resulted from mainly deletion events, and the amount of missing data when insertion events were equally likely to have caused the alignment gaps.
Conclusion
When gaps in an alignment are a consequence of indel events in the evolution of the sequences, the accuracy of phylogenetic analysis is likely to improve if: (1) alignment gaps are categorized as arising from insertion events or deletion events and then treated separately in the analysis, (2) the evolutionary signal provided by indels is harnessed in the phylogenetic analysis, and (3) methods that utilize the phylogenetic signal in indels are developed for distance methods too. When the true homology is known and the amount of gaps is 20 percent of the alignment length or less, the methods used in this study are likely to yield trees with 90–100 percent accuracy.
PICS-Ord: unlimited coding of ambiguous regions by pairwise identity and cost scores ordination
Robert Lücking, Brendan P Hodkinson, Alexandros Stamatakis, and Reed A Cartwright
BMC Bioinformatics. 2011; 12: 10. Published online 2011 January 7. doi: 10.1186/1471-2105-12-10.PMCID: PMC3024941
Phylogenetic assessment of alignments reveals neglected tree signal in gaps
Christophe Dessimoz and Manuel Gil
Genome Biol. 2010; 11(4): R37. Published online 2010 April 6. doi: 10.1186/gb-2010-11-4-r37.PMCID: PMC2884540
| Abstract | Full Text | PDF–741K | Supplementary Material |
Enhancing the quality of phylogenetic analysis using fuzzy hidden Markov model alignments.
Source
Lab of Medical Informatics, Faculty of Medicine, Department of Electrical and Computer Engineering, Aristotle University of Thessaloniki, Greece.
Abstract
Any effective phylogeny inference based on molecular data begins by performing efficient multiple sequence alignments. So far, the Hidden Markov Model (HMM) method for multiple sequence alignment has been proved competitive to the classical deterministic algorithms with respect to phylogenetic analysis; nevertheless, its stochastic nature does not help it cope with the existing dependence among the sequence elements. This paper deals with phylogenetic analysis of protein and gene data using multiple sequence alignments produced by fuzzy profile Hidden Markov Models. Fuzzy profile HMMs are a novel type of profile HMMs based on fuzzy sets and fuzzy integrals, which generalize the classical stochastic HMM by relaxing its independence assumptions. In this paper, alignments produced by the fuzzy HMM model are used in phylogenetic analysis of protein data, enhancing the quality of phylogenetic trees. The new methodology is implemented in HPV virus phylogenetic inference. The results of the analysis are compared against those obtained by the classical profile HMM model and depict the superiority of the fuzzy profile HMM in this field.
Discriminating between rate heterogeneity and interspecific recombination in DNA sequence alignments with phylogenetic factorial hidden Markov models.
Source
Biomathematics and Statistics, Scotland, Edinburgh, UK. dirk@bioss.ac.uk
Abstract
MOTIVATION:
A recently proposed method for detecting recombination in DNA sequence alignments is based on the combination of hidden Markov models (HMMs) with phylogenetic trees. Although this method was found to detect breakpoints of recombinant regions more accurately than most existing techniques, it inherently fails to distinguish between recombination and rate variation. In the present paper, we propose to marry the phylogenetic tree to a factorial HMM (FHMM). The states of the first hidden chain represent tree topologies, whereas the states of the second independent hidden chain represent different global scaling factors of the branch lengths. Inference is done in terms of a hierarchical Bayesian model, where parameters and hidden states are sampled from the posterior distribution with Gibbs sampling.
RESULTS:
We have tested the proposed model on various synthetic and real-world DNA sequence alignments. The simulation results suggest that as opposed to the standard phylogenetic HMM, the phylogenetic FHMM clearly distinguishes between recombination and rate heterogeneity and thereby avoids the prediction of spurious recombinant regions.
AVAILABILITY:
The proposed method has been implemented in a MATLAB package that extends Kevin Murphy's HMM toolbox. Software and data used in our study are available from http://www.bioss.sari.ac.uk/~dirk/Supplements