lunes, 4 de julio de 2011

Novel Phylogenetic Inference Software!

 http://www.metapiga.org/welcome.html
MetaPIGA 2 is a robust implementation of several stochastic heuristics for large phylogeny inference (under maximum likelihood), including a random-restart hill climbing, a simulated annealing algorithm, a classical genetic algorithm, and the metapopulation genetic algorithm (metaGA) together with complex substitution models, discrete Gamma rate heterogeneity, and the possibility to partition data. MetaPIGA 2 handles nucleic-acid and protein datasets as well as morphological (presence/absence) data. The benefits of the metaGA (Lemmon & Milinkovitch 2002; PNAS, 99: 10516-10521) are as follows: (i) it resolves the major problem inherent to classical Genetic Algorithms (i.e., the need to choose between strong selection, hence, speed, and weak selection, hence, accuracy) by maintaining high inter-population variation even under strong intra-population selection, and (ii) it can generate branch support values that approximate posterior probabilities.
The software MetaPIGA 2 also implements:

  • Simple dataset quality control (testing for the presence of identical sequences as well as for excessively ambiguous or excessively divergent sequences);
  • Automated trimming of poorly aligned regions using the trimAl algorithm;
  • The Likelihood Ratio Test, the Akaike Information Criterion, and the Bayesian Information Criterion for the easy selection of nucleotide and amino-acid substitution models that best fit the data;
  • Ancestral-state reconstruction of all nodes in the tree.
MetaPIGA 2 provides high customization of heuristics' and models' parameters, manual batch file and command line processing. However, it also offers an extensive and ergonomic graphical user interface and functionalities assisting the user for dataset quality testing, parameters setting, generating and running batch files, following run progress, and manipulating result trees.
MetaPIGA 2 uses standard formats for data sets and trees, is platform independent, runs in 32- and 64-bits systems, and takes advantage of multiprocessor and/or multicore computers. A version for Grid computing is in development.
 

Citing MetaPIGA 2

MetaPIGA v2.0: maximum likelihood large phylogeny estimation using the metapopulation genetic algorithm and other stochastic heuristics
Raphaël Helaers & Michel C. Milinkovitch
BMC Bioinformatics 2010, 11:379



http://bioinformatics.oxfordjournals.org/content/25/2/197.full

Phylogenetic inference under recombination using Bayesian stochastic topology selection


Abstract

Motivation: Conventional phylogenetic analysis for characterizing the relatedness between taxa typically assumes that a single relationship exists between species at every site along the genome. This assumption fails to take into account recombination which is a fundamental process for generating diversity and can lead to spurious results. Recombination induces a localized phylogenetic structure which may vary along the genome. Here, we generalize a hidden Markov model (HMM) to infer changes in phylogeny along multiple sequence alignments while accounting for rate heterogeneity; the hidden states refer to the unobserved phylogenic topology underlying the relatedness at a genomic location. The dimensionality of the number of hidden states (topologies) and their structure are random (not known a priori) and are sampled using Markov chain Monte Carlo algorithms. The HMM structure allows us to analytically integrate out over all possible changepoints in topologies as well as all the unknown branch lengths.

Results: We demonstrate our approach on simulated data and also to the genome of a suspected HIV recombinant strain as well as to an investigation of recombination in the sequences of 15 laboratory mouse strains sequenced by Perlegen Sciences. Our findings indicate that our method allows us to distinguish between rate heterogeneity and variation in phylogeny caused by recombination without being restricted to 4-taxa data.

Availability: The method has been implemented in JAVA and is available, along with data studied here, from http://www.stats.ox.ac.uk/~webb.

Contact: cholmes@stats.ox.ac.uk

Supplementary information: Supplementary data are available at Bioinformatics online.



http://www.stats.ox.ac.uk/__data/assets/pdf_file/0005/4010/large_pedigrees.pdf


http://www.cs.cmu.edu/~guestrin/Class/10701-S07/Handouts/recitations/HMM-inference.pdf


Probabilistic Phylogenetic Inference with Insertions and Deletions

Abstract Top

A fundamental task in sequence analysis is to calculate the probability of a multiple alignment given a phylogenetic tree relating the sequences and an evolutionary model describing how sequences change over time. However, the most widely used phylogenetic models only account for residue substitution events. We describe a probabilistic model of a multiple sequence alignment that accounts for insertion and deletion events in addition to substitutions, given a phylogenetic tree, using a rate matrix augmented by the gap character. Starting from a continuous Markov process, we construct a non-reversible generative (birth–death) evolutionary model for insertions and deletions. The model assumes that insertion and deletion events occur one residue at a time. We apply this model to phylogenetic tree inference by extending the program DNAML in PHYLIP. Using standard benchmarking methods on simulated data and a new "concordance test" benchmark on real ribosomal RNA alignments, we show that the extended program DNAMLε improves accuracy relative to the usual approach of ignoring gaps, while retaining the computational efficiency of the Felsenstein peeling algorithm.

Author Summary Top

We describe a computationally efficient method to use insertion and deletion events, in addition to substitutions, in phylogenetic inference. To date, many evolutionary models in probabilistic phylogenetic inference methods have only accounted for substitution events, not for insertions and deletions. As a result, not only do tree inference methods use less sequence information than they could, but also it has remained difficult to integrate phylogenetic modeling into sequence alignment methods (such as profiles and profile-hidden Markov models) that inherently require a model of insertion and deletion events. Therefore an important goal in the field has been to develop tractable evolutionary models of insertion/deletion events over time of sufficient accuracy to increase the resolution of phylogenetic inference methods and to increase the power of profile-based sequence homology searches. Our model offers a partial answer to this problem. We show that our model generally improves inference power in both simulated and real data and that it is easily implemented in the framework of standard inference packages with little effect on computational efficiency (we extended DNAML, in Felsenstein's popular PHYLIP package).



Materials and Methods Top

The C source code for the modified PHYLIP 3.66 package [14] that contains the program DNAMLε , the C source code for evolving sequences with the generative model (εRATE ), the modified ROSE package (version 1.3) [76], as well as all the Perl scripts and datasets used to generate the results presented in this paper are provided as a tarball in Dataset S1. The program DNAMLε uses the EASEL sequence analysis library (SRE, unpublished) which is also provided.

Roland F. Schwarz, William Fletcher, Frank Förster, Benjamin Merget, Matthias Wolf, Jörg Schultz, and Florian Markowetz
PLoS One. 2010; 5(12): e15788. Published online 2010 December 31. doi: 10.1371/journal.pone.0015788
PMCID:
PMC3013127

Bhakti Dwivedi and Sudhindra R Gadagkar
BMC Evol Biol. 2009; 9: 211. Published online 2009 August 23. doi: 10.1186/1471-2148-9-211
PMCID:
PMC2746219


Title: A stochastic evolution model for residue Insertion-Deletion Independent from Substitution
Author(s): Lebre S, Michel CJ
Source: COMPUTATIONAL BIOLOGY AND CHEMISTRY   Volume: 34   Issue: 5-6   Pages: 259-267   Published: DEC 2010
Times Cited: 0

Title: Genomes as documents of evolutionary history
Author(s): Boussau B, Daubin V
Source: TRENDS IN ECOLOGY & EVOLUTION   Volume: 25   Issue: 4   Pages: 224-232   Published: APR 2010
Times Cited: 2


http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2746219/?tool=pmcentrez
Phylogenetic inference under varying proportions of indel-induced alignment gaps
Bhakti Dwivedi1 and Sudhindra R Gadagkarcorresponding author1,2
1Department of Biology, University of Dayton, 300 College Park, Dayton, OH 46469-2320, USA
2Department of Natural Sciences, PO Box 1004, 1400 Brush Row Rd, Wilberforce, Ohio 45384, USA
corresponding authorCorresponding author.
Bhakti Dwivedi: dwivedbz@notes.udayton.edu; Sudhindra R Gadagkar: sgadagkar@centralstate.edu
Received May 11, 2009; Accepted August 23, 2009.
Background
The effect of alignment gaps on phylogenetic accuracy has been the subject of numerous studies. In this study, we investigated the relationship between the total number of gapped sites and phylogenetic accuracy, when the gaps were introduced (by means of computer simulation) to reflect indel (insertion/deletion) events during the evolution of DNA sequences. The resulting (true) alignments were subjected to commonly used gap treatment and phylogenetic inference methods.
Results
(1) In general, there was a strong – almost deterministic – relationship between the amount of gap in the data and the level of phylogenetic accuracy when the alignments were very "gappy", (2) gaps resulting from deletions (as opposed to insertions) contributed more to the inaccuracy of phylogenetic inference, (3) the probabilistic methods (Bayesian, PhyML & "MLε, " a method implemented in DNAML in PHYLIP) performed better at most levels of gap percentage when compared to parsimony (MP) and distance (NJ) methods, with Bayesian analysis being clearly the best, (4) methods that treat gapped sites as missing data yielded less accurate trees when compared to those that attribute phylogenetic signal to the gapped sites (by coding them as binary character data – presence/absence, or as in the MLε method), and (5) in general, the accuracy of phylogenetic inference depended upon the amount of available data when the gaps resulted from mainly deletion events, and the amount of missing data when insertion events were equally likely to have caused the alignment gaps.
Conclusion
When gaps in an alignment are a consequence of indel events in the evolution of the sequences, the accuracy of phylogenetic analysis is likely to improve if: (1) alignment gaps are categorized as arising from insertion events or deletion events and then treated separately in the analysis, (2) the evolutionary signal provided by indels is harnessed in the phylogenetic analysis, and (3) methods that utilize the phylogenetic signal in indels are developed for distance methods too. When the true homology is known and the amount of gaps is 20 percent of the alignment length or less, the methods used in this study are likely to yield trees with 90–100 percent accuracy.
 
PICS-Ord: unlimited coding of ambiguous regions by pairwise identity and cost scores ordination
Robert Lücking, Brendan P Hodkinson, Alexandros Stamatakis, and Reed A Cartwright
BMC Bioinformatics. 2011; 12: 10. Published online 2011 January 7. doi: 10.1186/1471-2105-12-10.
PMCID: PMC3024941
Phylogenetic assessment of alignments reveals neglected tree signal in gaps
Christophe Dessimoz and Manuel Gil
Genome Biol. 2010; 11(4): R37. Published online 2010 April 6. doi: 10.1186/gb-2010-11-4-r37.
PMCID: PMC2884540
| Abstract | Full Text | PDF–741K | Supplementary Material |



Stud Health Technol Inform. 2007;129(Pt 2):1245-9.

Enhancing the quality of phylogenetic analysis using fuzzy hidden Markov model alignments.

Source

Lab of Medical Informatics, Faculty of Medicine, Department of Electrical and Computer Engineering, Aristotle University of Thessaloniki, Greece.

Abstract

Any effective phylogeny inference based on molecular data begins by performing efficient multiple sequence alignments. So far, the Hidden Markov Model (HMM) method for multiple sequence alignment has been proved competitive to the classical deterministic algorithms with respect to phylogenetic analysis; nevertheless, its stochastic nature does not help it cope with the existing dependence among the sequence elements. This paper deals with phylogenetic analysis of protein and gene data using multiple sequence alignments produced by fuzzy profile Hidden Markov Models. Fuzzy profile HMMs are a novel type of profile HMMs based on fuzzy sets and fuzzy integrals, which generalize the classical stochastic HMM by relaxing its independence assumptions. In this paper, alignments produced by the fuzzy HMM model are used in phylogenetic analysis of protein data, enhancing the quality of phylogenetic trees. The new methodology is implemented in HPV virus phylogenetic inference. The results of the analysis are compared against those obtained by the classical profile HMM model and depict the superiority of the fuzzy profile HMM in this field.


Bioinformatics. 2005 Sep 1;21 Suppl 2:ii166-72.

Discriminating between rate heterogeneity and interspecific recombination in DNA sequence alignments with phylogenetic factorial hidden Markov models.

Source

Biomathematics and Statistics, Scotland, Edinburgh, UK. dirk@bioss.ac.uk

Abstract

MOTIVATION:

A recently proposed method for detecting recombination in DNA sequence alignments is based on the combination of hidden Markov models (HMMs) with phylogenetic trees. Although this method was found to detect breakpoints of recombinant regions more accurately than most existing techniques, it inherently fails to distinguish between recombination and rate variation. In the present paper, we propose to marry the phylogenetic tree to a factorial HMM (FHMM). The states of the first hidden chain represent tree topologies, whereas the states of the second independent hidden chain represent different global scaling factors of the branch lengths. Inference is done in terms of a hierarchical Bayesian model, where parameters and hidden states are sampled from the posterior distribution with Gibbs sampling.

RESULTS:

We have tested the proposed model on various synthetic and real-world DNA sequence alignments. The simulation results suggest that as opposed to the standard phylogenetic HMM, the phylogenetic FHMM clearly distinguishes between recombination and rate heterogeneity and thereby avoids the prediction of spurious recombinant regions.

AVAILABILITY:

The proposed method has been implemented in a MATLAB package that extends Kevin Murphy's HMM toolbox. Software and data used in our study are available from http://www.bioss.sari.ac.uk/~dirk/Supplements