Bioinformatics
& Softwares
Bioinformatics
& Softwares
Back to the
“MOLECULAR PHYLOGENETICS” page
Back to the
“LANE”
introduction page
PUBLICATIONS
‣Bioinformatics, 17: 573-574 (2001)
‣PNAS, 99: 10516-10521 (2002)
‣Bioinformatics, 19: 1505–1513 (2003)
‣Systematic Biology 54: 363-372 (2005)
‣Bioinformatics, 22: 115-116 (2006)
‣Bioinformatics, 22: 708-715 (2006)
‣Bioinformatics Online 2: 153-163 (2006)
‣BMC Evolutionary Biology 7:228 (2007)
‣Bioinformatics, 24 (2):151-157 (2008)
SOFTWARES
‣ProALign:
Check here our probabilistic multiple-alignment
program for molecular sequence data
‣Other softwares:
Check the list of our other programs at the beginning of this page
MAIN COLLABORATORS
Alan Lemmon
Was doing a pre-doctoral stay in Michel Milinkovitch’s lab and is now Assistant Professor at the University of Florida
Toni Galbadon
is group leader at the CRG-Centre for Genomic Regulation, Barcelona, Spain
List of softwares (see background info below)
Versatile and easy to use -- For all types of users -- GPU and multicore support
MetaPIGA is a versatile and easy-to-use software that implements robust stochastic heuristics (including the Metapopulation Genetic Algorithm, metaGA) for large phylogeny inference under maximum likelihood. MetaPIGA allows analyses of binary and molecular data sets under multiple substitution models, Gamma rate heterogeneity, and data partitioning. The software is for all types of users as it can be run through an extensive and ergonomic graphical interface, or by using batch ﬁles and console interface on your local machine or on distant servers. MetaPIGA is platform independent, runs on 32- and 64-bits systems, and easily takes advantage of multiprocessor and/or multicore computers.
Version 3 of MetaPIGA includes new functionalities such as:
✓Maximum likelihood models for codon evolution (Goldman-Yang 1994 and Empirical Models) with access to multiple genetic codes (universal, mitochondria, chloroplaste, etc);
✓Likelihood computation on CUDA-compatible Nvidia graphic cards (reducing run time by a factor of 10 to 20 for Protein/Codon data);
✓Ancestral-state reconstruction using empirical Bayesian inference.
Some of the other functionalities implemented in MetaPIGA are:
•Simple data quality control (testing for the presence of identical sequences as well as of excessively ambiguous or excessively-divergent sequences);
•Automated trimming of poorly aligned regions using the trimAl algorithm;
•The Likelihood Ratio Test, the Akaike Information Criterion, and the Bayesian Information Criterion methods for easy selection of the substitution model that best fits your data;
•Detailed monitoring of run progress;
•Convergence statistics for automatically defining when an analysis is complete;
•Ancestral-state reconstruction of all nodes in the tree;
•Viewing and manipulation of result trees.
MetaPIGA v.3 and its manual can be downloaded from the new website at www.metapiga.org.
Don’t hesitate to contact us (Dorde.Grbic-at-unige.ch or Michel.Milinkovitch-at-unige.ch) for additional questions, assistance, or bug reports.
‣TIQ
A simple Java Web Sart application that allows browsing phylogenetic trees of toxins (from bacterial Toxin-Antitoxin systems). Check below for much additional information.
‣ProAlign
A software implementing a multiple-alignment method that combines a Hidden Markov model (HMM), a progressive alignment algorithm, and a probabilistic character substitution model. Check below for much additional information.
An older (non statistical) approach for the identification of align positions that are unstable to variations of alignment costs was implemented in the program SOAP, available HERE.
‣CombineTrees
A software implementing an intra-specific network inference method (called ‘UMP’ for ‘Union of Most Parsimonious trees’) that unites all MP trees into a single (possibly reticulated) graph. See our “Genealogical Network Inference” page for much additional information.
‣Oligofaktory
a set of tools for the design, on an arbitrary number of target sequences, of high-quality long oligonucleotide for micro-array, of primer pair for PCR, of siRNA and more. Check below for additional information.
‣LANE private softwares
A series of programs for assisting ongoing research in our lab. Available only through the intranet.
Phylogeny inference
Optimality-criterion based phylogeny inference is a notoriously difficult endeavour because the number of solutions increases explosively (factorially) with the number of taxa. Indeed, the total number of possible unrooted, bifurcating tree topologies among T terminal taxa is
This corresponds to nearly 32 billion different trees for 14 taxa and 3x10^84 trees (i.e., more than the number of atoms in the known universe) for 55 taxa. Phylogeny inference is a NP-hard combinatorial optimization problem because (i) no known algorithm can solve it in polynomial time, and (ii) demonstrating the existence of such an algorithm would imply that all NP problems have a polynomial-time solution. As most mathematicians expect that no such algorithm exists, one is forced to admit that no future civilization will ever build a computer capable of solving the problem while guaranteeing that the optimal solution has been found. In other words, the problem will not be resolved by future increase in computing power. This makes the task of phylogeneticists difficult because: (i) major advances in molecular biology and biotechnology have caused a dramatic increase in the number of DNA sequences available, stimulating researchers to increase their taxon sampling when performing tree reconstruction, (ii) several studies (including some of ours) have suggested that the accuracy of phylogeny inference increases with increased taxon sampling, and (iii) the most accurate methods of phylogeny estimation (those based on maximum likelihood) are also the most computation-demanding. Given the tremendous number of new questions in evolutionary biology that could be investigated through the use of larger taxon samplings, most researchers are ready to give up the quest for the absolute optimal tree, opting instead for the ability to analyze large data sets in practical computing times, provided that these methods yield optimal or near-optimal solutions with high probability. In response to this trend, much of the current research in phyloinformatics (i.e., computational phylogenetics) concentrates on the development of more efficient heuristic approaches.
Many existing heuristic methods are handicapped with the need to optimize model parameters (such as branch lengths) on each tree examined, a procedure that significantly slows computation time. Because of the need to optimize parameters at every step, these methods remain computationally impractical for more than 25–50 taxa, depending on the complexity of the ML model. Two classes of solutions have been developed in response to the problems associated with optimizing parameters on large trees. The first of these classes includes solutions that partition the large problem of tree reconstruction into many small subproblems whose solutions are then combined into a consensus global solution (for example, the quartet puzzling method implemented in Tree-Puzzle). The second class is comprised of stochastic heuristics that avoid optimization of numerous trees entirely. Instead, they incorporate methods that allow branch lengths and other model parameters to be optimized as the search proceeds, taking an inter-step optimization strategy. Stochastic simulated annealing (SSA), for example, is based on a simple branch-swapping algorithm, but incorporates a method of perturbing branch lengths at each iteration instead of requiring optimization of each potential solution. SSA avoids local optima by accepting changes that decrease the likelihood of the tree with a probability inversely proportional to the reduction in likelihood. Another, very popular, approach is the Bayesian method based on the Markov chain Monte Carlo (MCMC) algorithm. MCMC-based methods (as that implemented in the software MrBayes) also benefit from avoiding intra-step optimization, although they have a slightly different aim: sampling the distribution of tree space instead of only finding optimal trees.
Finally, Matsuda (1996), Lewis (1998), and Katoh et al. (2001) have recently applied the genetic algorithm (GA), a type of evolutionary computation method, to phylogeny inference. GAs implement a set of operators that mimic processes of biological evolution such as mutation, recombination, selection, and reproduction. After an initial step of generating a population, the individuals (specific solutions) within that population are:
(i) subjected to mutation andor recombination and
(ii)allowed to reproduce with a probability that is a function of their relative fitness value.
In the case of a phylogenetic inference problem, individuals are typically composed of topologies and model parameters (e.g., branch lengths, the transition/transversion ratio, rate heterogeneity parameters, etc.) that need to be optimized. A mutation is a stochastic alteration of one component of the individual (e.g., a topological rearrangement, a change in one branch length, or a random modification of a model parameter), and the fitness of an individual is a function of the score for that tree. Because selection retains the changes that improve the value of the optimality function, the mean score of the population of trees improves over time, i.e., across generations.
The metaGA (metapopulation Genetic Algorithm)
Figure: Genetic algorithms are much faster than classical hill climbing methods. (a) Relative run times vs. number of taxa for the GA (one population), metaGA and the Stepwise Addition algorithm as well as for a single round of NNI, SPR, and TBR swapping (dotted lines), under the JC (blue) and HKY rate heterogeneity (red) ML models. Standard errors (10 runs) are indicated for the GA and metaGA only. (b) Ratio between StepAdd and average metaGA run times (vs. number of taxa) under the JC (Lower) and HKY rate heterogeneity (Upper) ML models.
Figure: CP increases the speed of phylogeny inference. The garph shows the score of the best tree (320 taxa) for each of four populations run in parallel (no CP). At the time indicated by the red arrow, one population (red) was allowed to use the consensus information from the three other populations (black). This process demonstrates that scores increase faster when CP is enabled.
Figure: A set of multiple metaGA searches produces trees and clades with frequencies that closely approximate their posterior probabilities, making the results of such a metaGA comparable to those provided by Bayesian MCMCMC approaches (such as implemented in MrBayes). The figure shows MetaGA branch support values (10 runs with probability CP among 10 populations; each dot therefore indicates the number of trees, among the 100 best trees produced, exhibiting a given partition) vs. (a) bootstrap values (100 replicates; random starting tree, SPR branch swapping) and (b) MrBayes posterior probabilities of partitions. A 55-taxa rbcL data set was analyzed under the JC model.
Check our full publication for much additional information (and references) :
✓Lemmon A. R. & M. C. Milinkovitch
The metapopulation genetic algorithm: an efficient solution for the problem of large phylogeny estimation
PNAS, 99: 10516-10521 (2002)
MetaPIGA version 2
✓Is much more robust and stable than version 1;
✓Incorporates the DNA and protein substitution models (including GTR) ;
✓Allows partitioning data;
✓Allows performing batch searches;
✓Incorporates multiple functionalities and an extensive graphical interface.
MetaPIGA version 3
✓Maximum likelihood models for codon evolution (Goldman-Yang 1994 and Empirical Models) with access to multiple genetic codes (universal, mitochondria, chloroplaste, etc);
✓Likelihood computation on CUDA-compatible Nvidia graphic cards (reducing run time by a factor of 10 to 20 for Protein/Codon data);
✓Ancestral-state reconstruction using empirical Bayesian inference.
An Ant Colony Optimization (ACO) algorithm for phylogenetic estimation
Figure: Principle of the ACO algorithm. The core iteration includes three main steps: (i) the pheromone update phase during which artificial ants walk on a graph with all possible connections among the n taxa and (n - 2) internal nodes, and lay a trail of volatile pheromone on the branches of the starting tree; (ii) the stochastic construction phase during which new trees are built using both the heuristic information of the pairwise distances and the stochastic process guided by the newly-updated pheromone trail matrix (ants follow a given edge with a probability which is a function of the amount of pheromone on that edge); and (iii) the 2-OPT local search phase that corresponds to a local search using taxon swapping. The curved arrow indicates the stochastic jump of an ant from one edge to another.
Much additional information is available in the original publication:
✓Catanzaro D., Pesenti R & M. C. Milinkovitch
An Ant Colony Optimization Algorithm for Phylogeny Estimation under the Minimum Evolution Principle
BMC Evolutionary Biology 7:228 (2007)
Multiple-sequence alignment
Optimal multiple alignment of molecular sequences is to homology estimation what an exact search is to tree inference: a NP-hard problem. Hence, it is, and will always be, computationally impractical for more than a few sequences. Yet, if new heuristics have been continuously developed for phylogeny inference, the problem of multiple alignment has been much less substantially investigated. Most researchers today pretend to ignore the issue and widely use profile-based variants of the progressive alignment methods. These approaches however require the subjective choice of insertion/deletion and substitution costs whose variations, even minimal, can dramatically impact the resulting alignment. One partial solution to this problem is provided by our software SOAP 1.0 (written in Java and available HERE) that allows for the identification of align positions that are unstable to variations of these costs. We then developed a multiple-alignment method that combines a Hidden Markov model (HMM), a progressive alignment algorithm, and a probabilistic character substitution model. This method (implemented in the program ProAlign 1.0 (and available at HERE) allows to compute the posterior probability of each alignment column. Gardner et al. (Nucleic Acids Research, 33: 2433–2439 [2005]) suggested that ProAlign yields the most accurate results when aligning structural RNA sequences. The method implemented in ProAlign has been implemented and extended in the software PRANK (developed in Nick Goldman’s group at EMBL-EBI). We do not intend to continue developing alignment methods in the near future.
Much additional information is available in the original publications:
✓Löytynoja A. & M. C. Milinkovitch
SOAP, a program for cleaning multiple alignments from unstable blocks
Bioinformatics, 17: 573-574 (2001)
✓Löytynoja A. & M. C. Milinkovitch
A hidden Markov model for progressive multiple alignment
Bioinformatics, 19: 1505–1513 (2003)
Check also Gardner et al. (2005): Nucleic Acids Research, Vol. 33, 2433-2439. They performed a benchmark of multiple sequence alignment programs upon structural RNAs, and their analyses indicate that ProAlign generally outperforms other sequence-based algorithms, under the broadest range of applications
Nucleotide substitution models
The general-time-reversible (GTR) model is one of the most popular models of nucleotide substitution because it constitutes a good trade-off between mathematical tractability and biological reality. However, when it is applied for inferring evolutionary distances and/or instantaneous rate matrices, the GTR model seems more prone to inapplicability than more restrictive time-reversible models. Although it has been previously noted that the causes for intractability are caused by the impossibility of computing the logarithm of a matrix characterised by negative eigenvalues, the issue had not been investigated further. Hence, we formally characterized the mathematical conditions, and discuss their biological interpretation, which lead to the inapplicability of the GTR model. We investigated the relations between, on one hand, the occurrence of negative eigenvalues and, on the other hand, both sequence length and sequence divergence. We then proposed a possible re-formulation of previous procedures in terms of a non-linear optimization problem. We analytically investigate the effect of our approach on the estimated evolutionary distances and transition probability matrix. We also provided an analysis on the goodness of the solution we propose. Finally, we simulated the evolution of DNA sequences along trees characterized by different combinations of tree length, (non-)homogeneity of the substitution rate matrix R, and sequence length. We then evaluated both the frequency of the GTR model inapplicability for estimating distances and the accuracy of inferred alignments. Our results indicated that, inapplicability of the Waddel and Steel’s procedure is a real, albeit relatively rare, practical issue, and illustrated that the probability of this inapplicability is a function of substitution rates and sequence length. We also discussed the implications of our results on the current implementations of maximum likelihood and Bayesian methods.
Much additional information is available in the original publications:
✓Gatto L., Catanzaro D. & M. C. Milinkovitch
Assessing the Applicability of the GTR Nucleotide Substitution Model Through Simulations
Evolutionary Bioinformatics Online 2: 153-163 (2006)
✓Catanzaro D., Pesenti R. & M. C. Milinkovitch
A nonlinear optimization procedure to estimate distances and instantaneous substitution rate matrices under the GTR model
Bioinformatics, 22: 708-715 (2006)
Other analytical developments and softwares
✓Cassens I., Mardulyn P. & M. C. Milinkovitch
Evaluating Intraspecific “Network” Construction Methods Using Simulated Sequence Data: Do Existing Algorithms Outperform the Global Maximum Parsimony Approach?
Systematic Biology 54: 363-372 (2005)
✓Rosa S., Milinkovitch M.C., Van Waerebeek K., Berck J., Oporto J., Alfaro-Shigueto J., Van Bressem M.F., Goodall N. & I. Cassens
Population structure of nuclear and mitochondrial DNA variation among South American Burmeister’s porpoises (Phocoena spinipinnis)
Conservation Genetics 6: 431–443 (2005)
✓Mardulyn P., Cassens I. & M. C. Milinkovitch
A comparison of methods for constructing evolutionary networks from intraspecific DNA sequence
Pages 104-120 (Chapter 5) in ‘Population Genetics for Animal Conservation’ (Bertorelle G, Bruford M.W.,
Hauffe H.C., Rizzoli A., & Vernesi C., eds.),
Cambridge University Press 2009
✓Guglielmini J., Szpirer C., & M. C. Milinkovitch
Automated Discovery and Phylogenetic Analysis of New Toxin-Antitoxin Systems
BMC Microbiology, 8: 104 (2008)
Note that TA systems have been domesticated as biotechnological tools for facilitating DNA engineering and protein production without the use of antibiotics:
✓Szpirer C. Y. & M. C. Milinkovitch
Separate-Component-Stabilization (SCS) System for protein and DNA production without the use of antibiotics
Biotechniques 38: 775-781 (2005)
✓The company Delphi Genetics develops and commercialise technologies and tools for facilitating DNA engineering (see our “Applied Evolutionary Genetics” web page).
(3) Oligonucleotide design. We developed the OLIGOFAKTORY, a set of tools for the design, on an arbitrary number of target sequences, of high-quality long oligonucleotide for micro-array, of primer pair for PCR, of siRNA and more. A unified presentation of results provides overviews with distribution charts and relative location bar graphs, as well as detailed features for each oligonucleotide. Input and output files conform to a common XML interchange file format to allow both automatic generation of input data, archiving, and postprocessing of results. The design pipeline can use BLAST servers to evaluate specificity of selected oligonucleotides. The software is available HERE. Much additional information is available in the original publications:
✓Schretter C. & M. C. Milinkovitch
OligoFaktory: a visual tool for interactive oligonucleotide design
Bioinformatics, 22: 115-116 (2006)