Bioinformatics

& Softwares

 
 















Back to the
“MOLECULAR PHYLOGENETICS” page


Back to the
“LANE”
introduction page




PUBLICATIONS

  1. Bioinformatics, 17: 573-574 (2001)

  2. PNAS, 99: 10516-10521 (2002)

  3. Bioinformatics, 19: 1505–1513 (2003)

  4. Systematic Biology 54: 363-372 (2005)

  5. Bioinformatics, 22: 115-116 (2006)

  6. Bioinformatics, 22: 708-715 (2006)

  7. Bioinformatics Online 2: 153-163 (2006)

  8. BMC Evolutionary Biology 7:228 (2007)

  9. Bioinformatics, 24 (2):151-157 (2008)


SOFTWARES

  1. MetaPIGA v3:
    Check HERE the latest version of our software (BMC Bioinformatics 2010, 11: 379) implementing the metapopulation genetic algorithm (metaGA, PNAS 2002):
    an efficient solution to the problem of large phylogeny estimation

  2. MANTiS:
    Check here our application system (Bioinformatics 2008) which attempts filling the gap between multi-species full genome comparisons and functional analysis
  3. ProALign:
    Check here our probabilistic multiple-alignment
    program for molecular sequence data

  4. Other softwares:
    Check the list of our other programs at the beginning of this page


MAIN COLLABORATORS


Alan Lemmon
Was doing a pre-doctoral stay in Michel Milinkovitch’s lab and is now Assistant Professor at the University of Florida

  1. web page


Toni Galbadon
is group leader at the CRG-Centre for Genomic Regulation, Barcelona, Spain

  1. web page



 

List of softwares (see background info below)

  1. MetaPIGA v3.1 (also with Codon models GPU computing, and more)

                             
                             

Versatile and easy to use   --   For all types of users   --   GPU and multicore support


  1. MetaPIGA is a versatile and easy-to-use software that implements robust stochastic heuristics (including the Metapopulation Genetic Algorithm, metaGA) for large phylogeny inference under maximum likelihood. MetaPIGA allows analyses of binary and molecular data sets under multiple substitution models, Gamma rate heterogeneity, and data partitioning.  The software is for all types of users as it can be run through an extensive and ergonomic graphical interface, or by using batch files and console interface on your local machine or on distant servers.  MetaPIGA is platform independent, runs on 32- and 64-bits systems, and easily takes advantage of multiprocessor and/or multicore computers.
    Version 3 of MetaPIGA includes new functionalities such as:

  2. Maximum likelihood models for codon evolution (Goldman-Yang 1994 and Empirical Models) with access to multiple genetic codes (universal, mitochondria, chloroplaste, etc);

  3. Likelihood computation on CUDA-compatible Nvidia graphic cards (reducing run time by a factor of 10 to 20 for Protein/Codon data);

  4. Ancestral-state reconstruction using empirical Bayesian inference.

  5. Some of the other functionalities implemented in MetaPIGA are:

  6. Simple data quality control (testing for the presence of identical sequences as well as of excessively ambiguous or excessively-divergent sequences);

  7. Automated trimming of poorly aligned regions using the trimAl algorithm;

  8. The Likelihood Ratio Test, the Akaike Information Criterion, and the Bayesian Information Criterion methods for easy selection of the substitution model that best fits your data;

  9. Detailed monitoring of run progress;

  10. Convergence statistics for automatically defining when an analysis is complete;

  11. Ancestral-state reconstruction of all nodes in the tree;

  12. Viewing and manipulation of result trees.

  13. The metaGA resolves the major problem inherent to classical genetic algorithms by maintaining high inter-population variation even under strong intra-population selection.

  14. MetaPIGA v.3 and its manual can be downloaded from the new website at www.metapiga.org.
    Don’t hesitate to contact us (Dorde.Grbic-at-unige.ch or Michel.Milinkovitch-at-unige.ch) for additional questions, assistance, or bug reports.



  15. MANTiS

    An application system for genome comparisons within an explicit phylogenetic framework and integrating  functional + expression data. Check below for much additional information.
  16. TIQ
    A simple Java Web Sart application that allows browsing phylogenetic trees of toxins (from bacterial Toxin-Antitoxin systems). Check below for much additional information.

  17. ProAlign
    A software implementing a multiple-alignment method that combines a Hidden Markov model (HMM), a progressive alignment algorithm, and a probabilistic character substitution model. Check below for much additional information.
    An older (non statistical) approach for the identification of align positions that are unstable to variations of alignment costs was implemented in the program SOAP, available HERE.

  18. CombineTrees
    A software implementing an intra-specific network inference method (called ‘UMP’ for ‘Union of Most Parsimonious trees’) that unites all MP trees into a single (possibly reticulated) graph. See our “Genealogical Network Inference” page for much additional information.

  19. Oligofaktory
    a set of tools for the design, on an arbitrary number of target sequences, of high-quality long oligonucleotide for micro-array, of primer pair for PCR, of siRNA and more. Check below for additional information.

  20. LANE private softwares
    A series of programs for assisting ongoing research in our lab. Available only through the intranet.




Phylogeny inference

Optimality-criterion based phylogeny inference is a notoriously difficult endeavour because the number of solutions increases explosively (factorially) with the number of taxa. Indeed, the total number of possible unrooted, bifurcating tree topologies among T terminal taxa is

i.e.,

This corresponds to nearly 32 billion different trees for 14 taxa and 3x10^84 trees (i.e., more than the number of atoms in the known universe) for 55 taxa.  Phylogeny inference is a NP-hard combinatorial optimization problem because (i) no known algorithm can solve it in polynomial time, and (ii) demonstrating the existence of such an algorithm would imply that all NP problems have a polynomial-time solution. As most mathematicians expect that no such algorithm exists, one is forced to admit that no future civilization will ever build a computer capable of solving the problem while guaranteeing that the optimal solution has been found. In other words, the problem will not be resolved by future increase in computing power. This makes the task of phylogeneticists difficult because: (i) major advances in molecular biology and biotechnology have caused a dramatic increase in the number of DNA sequences available, stimulating researchers to increase their taxon sampling when performing tree reconstruction, (ii) several studies (including some of ours) have suggested that the accuracy of phylogeny inference increases with increased taxon sampling, and (iii) the most accurate methods of phylogeny estimation (those based on maximum likelihood) are also the most computation-demanding. Given the tremendous number of new questions in evolutionary biology that could be investigated through the use of larger taxon samplings, most researchers are ready to give up the quest for the absolute optimal tree, opting instead for the ability to analyze large data sets in practical computing times, provided that these methods yield optimal or near-optimal solutions with high probability. In response to this trend, much of the current research in phyloinformatics (i.e., computational phylogenetics) concentrates on the development of more efficient heuristic approaches.

Many existing heuristic methods are handicapped with the need to optimize model parameters (such as branch lengths) on each tree examined, a procedure that significantly slows computation time. Because of the need to optimize parameters at every step, these methods remain computationally impractical for more than 25–50 taxa, depending on the complexity of the ML model. Two classes of solutions have been developed in response to the problems associated with optimizing parameters on large trees. The first of these classes includes solutions that partition the large problem of tree reconstruction into many small subproblems whose solutions are then combined into a consensus global solution (for example, the quartet puzzling method implemented in Tree-Puzzle). The second class is comprised of stochastic heuristics that avoid optimization of numerous trees entirely. Instead, they incorporate methods that allow branch lengths and other model parameters to be optimized as the search proceeds, taking an inter-step optimization strategy. Stochastic simulated annealing (SSA), for example, is based on a simple branch-swapping algorithm, but incorporates a method of perturbing branch lengths at each iteration instead of requiring optimization of each potential solution. SSA avoids local optima by accepting changes that decrease the likelihood of the tree with a probability inversely proportional to the reduction in likelihood. Another, very popular, approach is the Bayesian method based on the Markov chain Monte Carlo (MCMC) algorithm. MCMC-based methods (as that implemented in the software MrBayes) also benefit from avoiding intra-step optimization, although they have a slightly different aim: sampling the distribution of tree space instead of only finding optimal trees.

Finally, Matsuda (1996), Lewis (1998), and Katoh et al. (2001) have recently applied the genetic algorithm (GA), a type of evolutionary computation method, to phylogeny inference. GAs implement a set of operators that mimic processes of biological evolution such as mutation, recombination, selection, and reproduction. After an initial step of generating a population, the individuals (specific solutions) within that population are:

  1. (i) subjected to mutation andor recombination and

  2. (ii)allowed to reproduce with a probability that is a function of their relative fitness value.

In the case of a phylogenetic inference problem, individuals are typically composed of topologies and model parameters (e.g., branch lengths, the transition/transversion ratio, rate heterogeneity parameters, etc.) that need to be optimized. A mutation is a stochastic alteration of one component of the individual (e.g., a topological rearrangement, a change in one branch length, or a random modification of a model parameter), and the fitness of an individual is a function of the score for that tree. Because selection retains the changes that improve the value of the optimality function, the mean score of the population of trees improves over time, i.e., across generations.


The metaGA (metapopulation Genetic Algorithm)

We report here on a GA, nicknamed the metapopulation GA (metaGA), that vastly improves the speed and efficiency with which ML trees are found (such that nucleotide sequence data sets incorporating hundreds of taxa can be analyzed in practical computing times) and yields a probability index for each branch. The essence of the procedure relies on the coexistence of two or more populations interacting in a metapopulation setting. As the metaGA involves several parallel searches, a large amount of inter-population variation can be maintained, even when each population is subjected to strong selection pressures. However, the spirit of the metaGA is that the populations are not fully independent as they are forced to cooperate in the search for the optimal tree. Within each population, trees are subjected to evaluation, selection, and mutation events as they would be in a single-population GA, but all topological operators are guided through inter-population comparisons. The metaGA is based on the assumption that these comparisons allow the identification of partitions that are correct (and should be
somewhat constrained) and regions that still need to be resolved. Communication among the populations is defined and controlled through a principle we named consensus pruning (CP) (see Fig. 1 on the right).


Figure: The principle of CP. Before a tree is subjected to mutation, its topology is compared with those of the best tree(s) from other populations; the consensus branches (indicated in bold red) define the partitions that can (green arrows) and cannot (red arrows) be affected by topological mutations; i.e., any operation moving a taxon across a consensus branch is prohibited. The method used to define a consensus branch depends on the CP option
chosen by the user (see original paper for details). TXS, taxa swap; SPR, Tree Pruning Regrafting.


Figure: Genetic algorithms are much faster than classical hill climbing methods. (a) Relative run times vs. number of taxa for the GA (one population), metaGA and the Stepwise Addition algorithm as well as for a single round of NNI, SPR, and TBR swapping (dotted lines), under the JC (blue) and HKY  rate heterogeneity (red) ML models. Standard errors (10 runs) are indicated for the GA and metaGA only. (b) Ratio between StepAdd and average metaGA run times (vs. number of taxa) under the JC (Lower) and HKY rate heterogeneity (Upper) ML models.



Figure: CP increases the speed of phylogeny inference. The garph shows the score of the best tree (320 taxa) for each of four populations run in parallel (no CP). At the time indicated by the red arrow, one population (red) was allowed to use the consensus information from the three other populations (black). This process demonstrates that scores increase faster when CP is enabled.



Figure: The metaGA is not only fast but also accurate. (a) Run times (JC model, 160 taxa) for a one-population GA search (dotted circle) as well as for two-, three-, four-, six-, eight-, and 16-population metaGA searches under each of the CP options. Run time increases polynomially with the number of populations. (b) Run time (JC model, 160 taxa) vs. percent error for two-, three-, four-, six-, eight-, and 16-population metaGA searches under each of five CP options. Probability and strict consensus options find excellent topologies as soon as the metaGA is run with more than three populations. The probability option is therefore optimal as it is significantly faster than the strict consensus
option.


Figure: A set of multiple metaGA searches produces trees and clades with frequencies that closely approximate their posterior probabilities, making the results of such a metaGA comparable to those provided by Bayesian MCMCMC approaches (such as implemented in MrBayes). The figure shows MetaGA branch support values (10 runs with probability CP among 10 populations; each dot therefore indicates the number of trees, among the 100 best trees produced, exhibiting a given partition) vs. (a) bootstrap values (100 replicates; random starting tree, SPR branch swapping) and (b) MrBayes posterior probabilities of partitions. A 55-taxa rbcL data set was analyzed under the JC model.




Check our full publication for much additional information (and references) :

  1. Lemmon A. R. & M. C. Milinkovitch
    The metapopulation genetic algorithm: an efficient solution for the problem of large phylogeny estimation
    PNAS, 99: 10516-10521 (2002)

  2. BulletOpen Access

  3. BulletSupplementary Material

  4. BulletCoverage: Science Editor Choice

  5. BulletDownload the metaPIGA software



MetaPIGA version 2

  1. Is much more robust and stable than version 1;

  2. Incorporates the DNA and protein substitution models (including GTR) ;

  3. Allows partitioning data;

  4. Allows performing batch searches;

  5. Incorporates multiple functionalities and an extensive graphical interface.


MetaPIGA version 3

  1. Maximum likelihood models for codon evolution (Goldman-Yang 1994 and Empirical Models) with access to multiple genetic codes (universal, mitochondria, chloroplaste, etc);

  2. Likelihood computation on CUDA-compatible Nvidia graphic cards (reducing run time by a factor of 10 to 20 for Protein/Codon data);

  3. Ancestral-state reconstruction using empirical Bayesian inference.




An Ant Colony Optimization (ACO) algorithm for phylogenetic estimation

We recently introduce an Ant Colony Optimization (ACO) algorithm to estimate phylogenies under the minimum evolution principle. ACO is an optimization technique inspired from the foraging behavior of real ant colonies. This behavior is exploited in artificial ant colonies for the search of approximate solutions to discrete optimization problems. Although much improvement in performances can probably be obtained through (i) modification of the local search phase, (ii) tuning of the ACO parameters (number of ants, relative weights of the heuristic information and stochastic pheromone parameters, etc), and (iii) implementation of speed-up procedures and optimization of code, we already demonstrated that the ant colony metaphor can solve instances of the phylogeny inference problem with reasonable efficiency. This is the first application of an ACO algorithm to the phylogenetic estimation problem.


Figure: Principle of the ACO algorithm. The core iteration includes three main steps: (i) the pheromone update phase during which artificial ants walk on a graph with all possible connections among the n taxa and (n - 2) internal nodes, and lay a trail of volatile pheromone on the branches of the starting tree; (ii) the stochastic construction phase during which new trees are built using both the heuristic information of the pairwise distances and the stochastic process guided by the newly-updated pheromone trail matrix (ants follow a given edge with a probability which is a function of the amount of pheromone on that edge); and (iii) the 2-OPT local search phase that corresponds to a local search using taxon swapping. The curved arrow indicates the stochastic jump of an ant from one edge to another.


Much additional information is available in the original publication:

  1. Catanzaro D., Pesenti R & M. C. Milinkovitch
    An Ant Colony Optimization Algorithm for Phylogeny Estimation under the Minimum Evolution Principle
    BMC Evolutionary Biology 7:228 (2007)

  2. BulletOpen Access


Multiple-sequence alignment

Optimal multiple alignment of molecular sequences is to homology estimation what an exact search is to tree inference: a NP-hard problem. Hence, it is, and will always be, computationally impractical for more than a few sequences. Yet, if new heuristics have been continuously developed for phylogeny inference, the problem of multiple alignment has been much less substantially investigated. Most researchers today pretend to ignore the issue and widely use profile-based variants of the progressive alignment methods. These approaches however require the subjective choice of insertion/deletion and substitution costs whose variations, even minimal, can dramatically impact the resulting alignment. One partial solution to this problem is provided by our software SOAP 1.0 (written in Java and available HERE) that allows for the identification of align positions that are unstable to variations of these costs. We then developed a multiple-alignment method that combines a Hidden Markov model (HMM), a progressive alignment algorithm, and a probabilistic character substitution model. This method (implemented in the program ProAlign 1.0 (and available at HERE) allows to compute the posterior probability of each alignment column. Gardner et al. (Nucleic Acids Research, 33: 2433–2439 [2005]) suggested that ProAlign yields the most accurate results when aligning structural RNA sequences. The method implemented in ProAlign has been implemented and extended in the software PRANK (developed in Nick Goldman’s group at EMBL-EBI). We do not intend to continue developing alignment methods in the near future.


Much additional information is available in the original publications:

  1. Löytynoja A. & M. C. Milinkovitch
    SOAP, a program for cleaning multiple alignments from unstable blocks
    Bioinformatics, 17: 573-574 (2001)

  2. BulletOpen Access

  3. Löytynoja A. & M. C. Milinkovitch
    A hidden Markov model for progressive multiple alignment
    Bioinformatics, 19: 1505–1513 (2003)

  4. Bullete-mail

  5. Check also Gardner et al. (2005): Nucleic Acids Research, Vol. 33, 2433-2439. They performed a benchmark of multiple sequence alignment programs upon structural RNAs, and their analyses indicate that ProAlign generally outperforms other sequence-based algorithms, under the broadest range of applications

  6. BulletGardner_etal_NAR_2005.pdf



Nucleotide substitution models

The general-time-reversible (GTR) model is one of the most popular models of nucleotide substitution because it constitutes a good trade-off between mathematical tractability and biological reality. However, when it is applied for inferring evolutionary distances and/or instantaneous rate matrices, the GTR model seems more prone to inapplicability than more restrictive time-reversible models. Although it has been previously noted that the causes for intractability are caused by the impossibility of computing the logarithm of a matrix characterised by negative eigenvalues, the issue had not been investigated further. Hence, we formally characterized the mathematical conditions, and discuss their biological interpretation, which lead to the inapplicability of the GTR model. We investigated the relations between, on one hand, the occurrence of negative eigenvalues and, on the other hand, both sequence length and sequence divergence. We then proposed a possible re-formulation of previous procedures in terms of a non-linear optimization problem. We analytically investigate the effect of our approach on the estimated evolutionary distances and transition probability matrix. We also provided an analysis on the goodness of the solution we propose. Finally, we simulated the evolution of  DNA sequences along trees characterized by different combinations of tree length, (non-)homogeneity of the substitution rate matrix R, and sequence  length. We then evaluated both the frequency of the GTR model inapplicability for estimating distances and the accuracy of inferred alignments. Our  results indicated that, inapplicability of the Waddel and Steel’s procedure is a real, albeit relatively rare, practical issue, and illustrated that the probability of this inapplicability is a function of substitution rates and sequence length.  We also discussed the implications of our results on the current implementations of maximum likelihood and Bayesian methods.


Much additional information is available in the original publications:

  1. Gatto L., Catanzaro D. & M. C. Milinkovitch
    Assessing the Applicability of the GTR Nucleotide Substitution Model Through Simulations
    Evolutionary Bioinformatics Online 2: 153-163 (2006)

  2. Bullete-mail

  3. Catanzaro D., Pesenti R. & M. C. Milinkovitch
    A nonlinear optimization procedure to estimate distances and instantaneous substitution rate matrices under the GTR model
    Bioinformatics, 22: 708-715 (2006)

  4. Bullete-mail



Other analytical developments and softwares


(1) Genealogical Network Inference. During our work on  the conservation genetics of dusky dolphins & Burmeister’s porpoises, we analysed sequence data with several widely used network estimation and rooting methods. The resulting intraspecific gene genealogies and rooting inferences exhibited substantial differences, underlying the limitations of some algorithms. Given that scientific hypotheses and management decisions depend strongly on inferred tree or network topologies, there is a clear need for a systematic comparative analysis of available methods. Furthermore, as available software packages implementing the global maximum parsimony (MP) approach only give the possibility to merge resulting topologies into less-resolved consensus trees, MP has often been neglected as an alternative approach to purely algorithmic (i.e., methods defined solely on the basis of an algorithm) “network” construction methods.  We therefore (i) evaluated the advantages and weaknesses of some of the most commonly used network methods, (ii) discussed how to choose an appropriate method for analyzing population sequence data, and (iii) present a new method (named “UMP”) for uniting all equally most parsimonious trees into a single (possibly reticulated) graph. Using simulated sequence data, we compared our method with three purely algorithmic and widely used graph construction approaches. Much additional information is available in our “Genealogical Network Inference” web page as well as in the following publications:
  1. Cassens I., Mardulyn P. & M. C. Milinkovitch
    Evaluating Intraspecific “Network” Construction Methods Using Simulated Sequence Data: Do Existing Algorithms Outperform the Global Maximum Parsimony Approach?
    Systematic Biology 54: 363-372 (2005)

  2. BulletOpen Access

  3. Rosa S., Milinkovitch M.C., Van Waerebeek K., Berck J., Oporto J., Alfaro-Shigueto J., Van Bressem M.F., Goodall N. & I. Cassens
    Population structure of nuclear and mitochondrial DNA variation among South American Burmeister’s porpoises (Phocoena spinipinnis)
    Conservation Genetics 6: 431–443 (2005)

  4. Bullete-mail

  5. Mardulyn P., Cassens I. & M. C. Milinkovitch
    A comparison of methods for constructing evolutionary networks from intraspecific DNA sequence
    Pages 104-120 (Chapter 5) in ‘Population Genetics for Animal Conservation’ (Bertorelle G, Bruford M.W.,
    Hauffe H.C., Rizzoli A., & Vernesi C., eds.),
    Cambridge University Press 2009

  6. Bullete-mail


(2) Browsing phylogenetic trees of bacterial toxins (from TA systems). Toxin-Antitoxin systems are plasmid spectacular maintenance mechanisms that make them purely "selfish DNA" candidates. A TA system is made of a gene coding for a cytotoxic stable protein preceded by a gene coding for an unstable antitoxin. The toxin being more stable than the antitoxin, absence of the operon causes a reduction of the amount of the latter relative to the amount of the former. Thus, a cell exhibiting a TA system on a plasmid is 'condemned' either not to loose it or to die. In the context of our
phylogenetic analyses of TA systems, we developed an algorithm that searches public databases on the basis of predefined similarity and TA-specific structural constraints. This approach, using a single starting query sequence for each of the TA families identified over 1,500 putative TA systems. These groups of sequences were analyzed phylogenetically for a better classification and understanding of TA systems evolution. The resulting phylogenetic trees are available for browsing and searching through a small java program (TIQ) available HERE. Much additional information is available in our full publication:
  1. Guglielmini J., Szpirer C., & M. C. Milinkovitch
    Automated Discovery and Phylogenetic Analysis of New Toxin-Antitoxin Systems
    BMC Microbiology, 8: 104 (2008)

  2. BulletOpen Access

Note that TA systems have been domesticated as biotechnological tools for facilitating DNA engineering and protein production without the use of antibiotics:

  1. Szpirer C. Y. & M. C. Milinkovitch
    Separate-Component-Stabilization (SCS) System for protein and DNA production without the use of antibiotics
    Biotechniques 38: 775-781 (2005)

  2. BulletOpen Access

  3. The company Delphi Genetics develops and commercialise technologies and tools for facilitating DNA engineering (see our “Applied Evolutionary Genetics” web page).

    

(3) Oligonucleotide design. We developed the OLIGOFAKTORY, a set of tools for the design, on an arbitrary number of target sequences, of high-quality long oligonucleotide for micro-array, of primer pair for PCR, of siRNA and more. A unified presentation of results provides overviews with distribution charts and relative location bar graphs, as well as detailed features for each oligonucleotide. Input and output files conform to a common XML interchange file format to allow both automatic generation of input data, archiving, and postprocessing of results. The design pipeline can use BLAST servers to evaluate specificity of selected oligonucleotides. The software is available HERE. Much additional information is available in the original publications:

  1. Schretter C. & M. C. Milinkovitch
    OligoFaktory: a visual tool for interactive oligonucleotide design
    Bioinformatics, 22: 115-116 (2006)

  2. BulletOpen Access


Go to the intranet

Back to top