viernes, 20 de noviembre de 2015

Statistical Tests for Clonality

Statistical Tests for Clonality

SUMMARY

Cancer investigators frequently conduct studies to examine tumor samples from pairs of apparently independent primary tumors with a view to determining if they share a “clonal” origin. The genetic fingerprints of the tumors are compared using a panel of markers, often representing loss of heterogeneity (LOH) at distinct genetic loci. In this article we evaluate candidate significance tests for this purpose. The relevant information derives from the observed correlation of the tumors with respect to the occurrence of LOH at individual loci, a phenomenon that can be evaluated using Fisher’s Exact Test. Information is also available from the extent to which losses at the same locus occur on the same parental allele. Data from these combined sources of information can be evaluated using a simple adaptation of Fisher’s Exact Test. The test statistic is the total number of loci at which concordant mutations occur on the same parental allele, with higher values providing more evidence in favor of a clonal origin for the two tumors. The test is shown to have high power for detecting clonality for plausible models of the alternative (clonal) hypothesis, and for reasonable numbers of informative loci, preferably located on distinct chromosomal arms. The method is illustrated using studies to identify clonality in contralateral breast cancer. Interpretation of the results of these tests requires caution due to simplifying assumptions regarding the possible variability in mutation probabilities between loci, and possible imbalances in the mutation probabilities between parental alleles. Nonetheless, we conclude that the method represents a simple, powerful strategy for distinguishing independent tumors from those of clonal origin.
Keywords: Clonality, Permutation test, Second primary cancers

Clonality: A Package for Clonality testing
Statistical Challenges in Testing Clonal





Molecular Evolution

Inference of Population Splits and Mixtures from Genome-Wide Allele Frequency Data

Abstract

Many aspects of the historical relationships between populations in a species are reflected in genetic data. Inferring these relationships from genetic data, however, remains a challenging task. In this paper, we present a statistical model for inferring the patterns of population splits and mixtures in multiple populations. In our model, the sampled populations in a species are related to their common ancestor through a graph of ancestral populations. Using genome-wide allele frequency data and a Gaussian approximation to genetic drift, we infer the structure of this graph. We applied this method to a set of 55 human populations and a set of 82 dog breeds and wild canids. In both species, we show that a simple bifurcating tree does not fully describe the data; in contrast, we infer many migration events. While some of the migration events that we find have been detected previously, many have not. For example, in the human data, we infer that Cambodians trace approximately 16% of their ancestry to a population ancestral to other extant East Asian populations. In the dog data, we infer that both the boxer and basenji trace a considerable fraction of their ancestry (9% and 25%, respectively) to wolves subsequent to domestication and that East Asian toy breeds (the Shih Tzu and the Pekingese) result from admixture between modern toy breeds and “ancient” Asian breeds. Software implementing the model described here, called TreeMix, is available at http://treemix.googlecode.com.

Abstract

Phylogenies of highly genetically variable viruses such as HIV-1 are potentially informative of epidemiological dynamics. Several studies have demonstrated the presence of clusters of highly related HIV-1 sequences, particularly among recently HIV-infected individuals, which have been used to argue for a high transmission rate during acute infection. Using a large set of HIV-1 subtype B pol sequences collected from men who have sex with men, we demonstrate that virus from recent infections tend to be phylogenetically clustered at a greater rate than virus from patients with chronic infection (‘excess clustering’) and also tend to cluster with other recent HIV infections rather than chronic, established infections (‘excess co-clustering’), consistent with previous reports. To determine the role that a higher infectivity during acute infection may play in excess clustering and co-clustering, we developed a simple model of HIV infection that incorporates an early period of intensified transmission, and explicitly considers the dynamics of phylogenetic clusters alongside the dynamics of acute and chronic infected cases. We explored the potential for clustering statistics to be used for inference of acute stage transmission rates and found that no single statistic explains very much variance in parameters controlling acute stage transmission rates. We demonstrate that high transmission rates during the acute stage is not the main cause of excess clustering of virus from patients with early/acute infection compared to chronic infection, which may simply reflect the shorter time since transmission in acute infection. Higher transmission during acute infection can result in excess co-clustering of sequences, while the extent of clustering observed is most sensitive to the fraction of infections sampled.

A general linear model-based approach for inferring selection to climate


Estimation of Population Genetic Structure Software and Methods

Artìculos útiles para Estimación de Estructura poblacional:

On Identifying the Optimal Number of Population Clusters via the Deviance Information Criterion
On Identifying the... 

Detecting correlation between allele frequencies and environmental variables as a signature of selection. A fast computational approach for genome-wide studies

Detecting and measuring selection from gene frequency data












GenClone 2.0

GenClone: a computer program to analyze genotypic data, test for clonality and describe spatial clonal organization
Arnaud-Haond Sophie and Belkhir Khalid
«Team MAREE» - CCMAR, Algarve University, FCMA, Gambelas, 8005-139 Faro, PORTUGAL
«Génome, Populations, Interactions »-Université Montpellier II, Place Eugène Bataillon ; 34090 Montpellier Cedex, FRANCE

Link: GenClone

lunes, 4 de julio de 2011

Novel Phylogenetic Inference Software!

 http://www.metapiga.org/welcome.html
MetaPIGA 2 is a robust implementation of several stochastic heuristics for large phylogeny inference (under maximum likelihood), including a random-restart hill climbing, a simulated annealing algorithm, a classical genetic algorithm, and the metapopulation genetic algorithm (metaGA) together with complex substitution models, discrete Gamma rate heterogeneity, and the possibility to partition data. MetaPIGA 2 handles nucleic-acid and protein datasets as well as morphological (presence/absence) data. The benefits of the metaGA (Lemmon & Milinkovitch 2002; PNAS, 99: 10516-10521) are as follows: (i) it resolves the major problem inherent to classical Genetic Algorithms (i.e., the need to choose between strong selection, hence, speed, and weak selection, hence, accuracy) by maintaining high inter-population variation even under strong intra-population selection, and (ii) it can generate branch support values that approximate posterior probabilities.
The software MetaPIGA 2 also implements:

  • Simple dataset quality control (testing for the presence of identical sequences as well as for excessively ambiguous or excessively divergent sequences);
  • Automated trimming of poorly aligned regions using the trimAl algorithm;
  • The Likelihood Ratio Test, the Akaike Information Criterion, and the Bayesian Information Criterion for the easy selection of nucleotide and amino-acid substitution models that best fit the data;
  • Ancestral-state reconstruction of all nodes in the tree.
MetaPIGA 2 provides high customization of heuristics' and models' parameters, manual batch file and command line processing. However, it also offers an extensive and ergonomic graphical user interface and functionalities assisting the user for dataset quality testing, parameters setting, generating and running batch files, following run progress, and manipulating result trees.
MetaPIGA 2 uses standard formats for data sets and trees, is platform independent, runs in 32- and 64-bits systems, and takes advantage of multiprocessor and/or multicore computers. A version for Grid computing is in development.
 

Citing MetaPIGA 2

MetaPIGA v2.0: maximum likelihood large phylogeny estimation using the metapopulation genetic algorithm and other stochastic heuristics
Raphaël Helaers & Michel C. Milinkovitch
BMC Bioinformatics 2010, 11:379



http://bioinformatics.oxfordjournals.org/content/25/2/197.full

Phylogenetic inference under recombination using Bayesian stochastic topology selection


Abstract

Motivation: Conventional phylogenetic analysis for characterizing the relatedness between taxa typically assumes that a single relationship exists between species at every site along the genome. This assumption fails to take into account recombination which is a fundamental process for generating diversity and can lead to spurious results. Recombination induces a localized phylogenetic structure which may vary along the genome. Here, we generalize a hidden Markov model (HMM) to infer changes in phylogeny along multiple sequence alignments while accounting for rate heterogeneity; the hidden states refer to the unobserved phylogenic topology underlying the relatedness at a genomic location. The dimensionality of the number of hidden states (topologies) and their structure are random (not known a priori) and are sampled using Markov chain Monte Carlo algorithms. The HMM structure allows us to analytically integrate out over all possible changepoints in topologies as well as all the unknown branch lengths.

Results: We demonstrate our approach on simulated data and also to the genome of a suspected HIV recombinant strain as well as to an investigation of recombination in the sequences of 15 laboratory mouse strains sequenced by Perlegen Sciences. Our findings indicate that our method allows us to distinguish between rate heterogeneity and variation in phylogeny caused by recombination without being restricted to 4-taxa data.

Availability: The method has been implemented in JAVA and is available, along with data studied here, from http://www.stats.ox.ac.uk/~webb.

Contact: cholmes@stats.ox.ac.uk

Supplementary information: Supplementary data are available at Bioinformatics online.



http://www.stats.ox.ac.uk/__data/assets/pdf_file/0005/4010/large_pedigrees.pdf


http://www.cs.cmu.edu/~guestrin/Class/10701-S07/Handouts/recitations/HMM-inference.pdf


Probabilistic Phylogenetic Inference with Insertions and Deletions

Abstract Top

A fundamental task in sequence analysis is to calculate the probability of a multiple alignment given a phylogenetic tree relating the sequences and an evolutionary model describing how sequences change over time. However, the most widely used phylogenetic models only account for residue substitution events. We describe a probabilistic model of a multiple sequence alignment that accounts for insertion and deletion events in addition to substitutions, given a phylogenetic tree, using a rate matrix augmented by the gap character. Starting from a continuous Markov process, we construct a non-reversible generative (birth–death) evolutionary model for insertions and deletions. The model assumes that insertion and deletion events occur one residue at a time. We apply this model to phylogenetic tree inference by extending the program DNAML in PHYLIP. Using standard benchmarking methods on simulated data and a new "concordance test" benchmark on real ribosomal RNA alignments, we show that the extended program DNAMLε improves accuracy relative to the usual approach of ignoring gaps, while retaining the computational efficiency of the Felsenstein peeling algorithm.

Author Summary Top

We describe a computationally efficient method to use insertion and deletion events, in addition to substitutions, in phylogenetic inference. To date, many evolutionary models in probabilistic phylogenetic inference methods have only accounted for substitution events, not for insertions and deletions. As a result, not only do tree inference methods use less sequence information than they could, but also it has remained difficult to integrate phylogenetic modeling into sequence alignment methods (such as profiles and profile-hidden Markov models) that inherently require a model of insertion and deletion events. Therefore an important goal in the field has been to develop tractable evolutionary models of insertion/deletion events over time of sufficient accuracy to increase the resolution of phylogenetic inference methods and to increase the power of profile-based sequence homology searches. Our model offers a partial answer to this problem. We show that our model generally improves inference power in both simulated and real data and that it is easily implemented in the framework of standard inference packages with little effect on computational efficiency (we extended DNAML, in Felsenstein's popular PHYLIP package).



Materials and Methods Top

The C source code for the modified PHYLIP 3.66 package [14] that contains the program DNAMLε , the C source code for evolving sequences with the generative model (εRATE ), the modified ROSE package (version 1.3) [76], as well as all the Perl scripts and datasets used to generate the results presented in this paper are provided as a tarball in Dataset S1. The program DNAMLε uses the EASEL sequence analysis library (SRE, unpublished) which is also provided.

Roland F. Schwarz, William Fletcher, Frank Förster, Benjamin Merget, Matthias Wolf, Jörg Schultz, and Florian Markowetz
PLoS One. 2010; 5(12): e15788. Published online 2010 December 31. doi: 10.1371/journal.pone.0015788
PMCID:
PMC3013127

Bhakti Dwivedi and Sudhindra R Gadagkar
BMC Evol Biol. 2009; 9: 211. Published online 2009 August 23. doi: 10.1186/1471-2148-9-211
PMCID:
PMC2746219


Title: A stochastic evolution model for residue Insertion-Deletion Independent from Substitution
Author(s): Lebre S, Michel CJ
Source: COMPUTATIONAL BIOLOGY AND CHEMISTRY   Volume: 34   Issue: 5-6   Pages: 259-267   Published: DEC 2010
Times Cited: 0

Title: Genomes as documents of evolutionary history
Author(s): Boussau B, Daubin V
Source: TRENDS IN ECOLOGY & EVOLUTION   Volume: 25   Issue: 4   Pages: 224-232   Published: APR 2010
Times Cited: 2


http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2746219/?tool=pmcentrez
Phylogenetic inference under varying proportions of indel-induced alignment gaps
Bhakti Dwivedi1 and Sudhindra R Gadagkarcorresponding author1,2
1Department of Biology, University of Dayton, 300 College Park, Dayton, OH 46469-2320, USA
2Department of Natural Sciences, PO Box 1004, 1400 Brush Row Rd, Wilberforce, Ohio 45384, USA
corresponding authorCorresponding author.
Bhakti Dwivedi: dwivedbz@notes.udayton.edu; Sudhindra R Gadagkar: sgadagkar@centralstate.edu
Received May 11, 2009; Accepted August 23, 2009.
Background
The effect of alignment gaps on phylogenetic accuracy has been the subject of numerous studies. In this study, we investigated the relationship between the total number of gapped sites and phylogenetic accuracy, when the gaps were introduced (by means of computer simulation) to reflect indel (insertion/deletion) events during the evolution of DNA sequences. The resulting (true) alignments were subjected to commonly used gap treatment and phylogenetic inference methods.
Results
(1) In general, there was a strong – almost deterministic – relationship between the amount of gap in the data and the level of phylogenetic accuracy when the alignments were very "gappy", (2) gaps resulting from deletions (as opposed to insertions) contributed more to the inaccuracy of phylogenetic inference, (3) the probabilistic methods (Bayesian, PhyML & "MLε, " a method implemented in DNAML in PHYLIP) performed better at most levels of gap percentage when compared to parsimony (MP) and distance (NJ) methods, with Bayesian analysis being clearly the best, (4) methods that treat gapped sites as missing data yielded less accurate trees when compared to those that attribute phylogenetic signal to the gapped sites (by coding them as binary character data – presence/absence, or as in the MLε method), and (5) in general, the accuracy of phylogenetic inference depended upon the amount of available data when the gaps resulted from mainly deletion events, and the amount of missing data when insertion events were equally likely to have caused the alignment gaps.
Conclusion
When gaps in an alignment are a consequence of indel events in the evolution of the sequences, the accuracy of phylogenetic analysis is likely to improve if: (1) alignment gaps are categorized as arising from insertion events or deletion events and then treated separately in the analysis, (2) the evolutionary signal provided by indels is harnessed in the phylogenetic analysis, and (3) methods that utilize the phylogenetic signal in indels are developed for distance methods too. When the true homology is known and the amount of gaps is 20 percent of the alignment length or less, the methods used in this study are likely to yield trees with 90–100 percent accuracy.
 
PICS-Ord: unlimited coding of ambiguous regions by pairwise identity and cost scores ordination
Robert Lücking, Brendan P Hodkinson, Alexandros Stamatakis, and Reed A Cartwright
BMC Bioinformatics. 2011; 12: 10. Published online 2011 January 7. doi: 10.1186/1471-2105-12-10.
PMCID: PMC3024941
Phylogenetic assessment of alignments reveals neglected tree signal in gaps
Christophe Dessimoz and Manuel Gil
Genome Biol. 2010; 11(4): R37. Published online 2010 April 6. doi: 10.1186/gb-2010-11-4-r37.
PMCID: PMC2884540
| Abstract | Full Text | PDF–741K | Supplementary Material |



Stud Health Technol Inform. 2007;129(Pt 2):1245-9.

Enhancing the quality of phylogenetic analysis using fuzzy hidden Markov model alignments.

Source

Lab of Medical Informatics, Faculty of Medicine, Department of Electrical and Computer Engineering, Aristotle University of Thessaloniki, Greece.

Abstract

Any effective phylogeny inference based on molecular data begins by performing efficient multiple sequence alignments. So far, the Hidden Markov Model (HMM) method for multiple sequence alignment has been proved competitive to the classical deterministic algorithms with respect to phylogenetic analysis; nevertheless, its stochastic nature does not help it cope with the existing dependence among the sequence elements. This paper deals with phylogenetic analysis of protein and gene data using multiple sequence alignments produced by fuzzy profile Hidden Markov Models. Fuzzy profile HMMs are a novel type of profile HMMs based on fuzzy sets and fuzzy integrals, which generalize the classical stochastic HMM by relaxing its independence assumptions. In this paper, alignments produced by the fuzzy HMM model are used in phylogenetic analysis of protein data, enhancing the quality of phylogenetic trees. The new methodology is implemented in HPV virus phylogenetic inference. The results of the analysis are compared against those obtained by the classical profile HMM model and depict the superiority of the fuzzy profile HMM in this field.


Bioinformatics. 2005 Sep 1;21 Suppl 2:ii166-72.

Discriminating between rate heterogeneity and interspecific recombination in DNA sequence alignments with phylogenetic factorial hidden Markov models.

Source

Biomathematics and Statistics, Scotland, Edinburgh, UK. dirk@bioss.ac.uk

Abstract

MOTIVATION:

A recently proposed method for detecting recombination in DNA sequence alignments is based on the combination of hidden Markov models (HMMs) with phylogenetic trees. Although this method was found to detect breakpoints of recombinant regions more accurately than most existing techniques, it inherently fails to distinguish between recombination and rate variation. In the present paper, we propose to marry the phylogenetic tree to a factorial HMM (FHMM). The states of the first hidden chain represent tree topologies, whereas the states of the second independent hidden chain represent different global scaling factors of the branch lengths. Inference is done in terms of a hierarchical Bayesian model, where parameters and hidden states are sampled from the posterior distribution with Gibbs sampling.

RESULTS:

We have tested the proposed model on various synthetic and real-world DNA sequence alignments. The simulation results suggest that as opposed to the standard phylogenetic HMM, the phylogenetic FHMM clearly distinguishes between recombination and rate heterogeneity and thereby avoids the prediction of spurious recombinant regions.

AVAILABILITY:

The proposed method has been implemented in a MATLAB package that extends Kevin Murphy's HMM toolbox. Software and data used in our study are available from http://www.bioss.sari.ac.uk/~dirk/Supplements


martes, 1 de diciembre de 2009


jueves, 9 de julio de 2009

allele determination

http://crop.scijournals.org/cgi/content/full/46/5/2084
Crop Science PLANT GENETIC RESOURCES
Accuracy and Reliability of High-Throughput Microsatellite Genotyping for Cacao Clone Identification


http://www-naweb.iaea.org/nafa/aph/stories/dna-manual.pdf
A practical approach to microsatellite genotyping with special reference to livestock population genetics