EvoBIO Logo

The European Conference on Evolutionary Computation, Machine Learning and Data Mining in Computational Biology

EvoBio 2013 Accepted Papers

EvoBIO is a multidisciplinary conference that brings together researchers working in Bioinformatics and Computational Biology that apply advanced techniques coming from Evolutionary Computation, Machine Learning, and Data Mining to address important problems in biology, from the molecular and genomic dimension, to the individual and population level. The primary focus of the conference is to present the latest advances of these approaches for Bioinformatics and Computational Biology, and to provide a forum for the discussion of new research directions.

The 11th edition of EvoBIO will take place in the wonderful location of Wien, Austria. The conference will be held in conjunction with the EuroGP (16th European Conference on Genetic Programming), EvoCOP 2010 (13th European Conference on Evolutionary Computation in Combinatorial Optimisation), EvoMUSART (11th European Conference on Evolutionary and Biologically Inspired Music, Sound, Art and Design), and EvoApplications 2013, the specialist conference on a range of evolutionary computation topics and applications.

For more information, visit our web page at http://www.evostar.org, or follow us on Twitter @EVOBio2013, join our evobio2013 group on LinkedIn and our EvoBIO group on Facebook, or email to evobio.conference(at)gmail.com.

Areas of Interest and Contributions

The conference invites contributions on the application of techniques typical of the Evolutionary Computation, Machine Learning, and Data Mining domains to address problems in Bioinformatics and Computational Biology.

Topics include but are not limited to:

Publication Details

Each accepted paper will be presented orally or as a poster at the conference and will be printed in the proceedings published by Springer Verlag in the LNCS series. The accepted papers of the previous editions of EvoBIO were published in the Springer Verlag LNCS. The first 4 editions (2003–2006) were published in the LNCS 2611, 3005, 3449, 3907, together with the remaining EvoWorkshops. Since 2007 evobio became a EVO* conference and its proceedings were published in Springer Verlag LNCS 4447, 4973, 5483, 6023, 6622, and 7246 respectively.

Post-conference Journal Publication

BioData Mining Journal

The BioData Mining journal (edited by BioMed Central) is offering a sponsorship of the EvoBIO best paper candidates and award and will publish the best paper. Papers nominated for the best paper award will also be invited to expand their article for post-conference publication in the BioMed Central journal BioData Mining, with a 25% discount on the publication fees. Moreover, the winner of the Best Paper Award will have a full waiver of the publication fees for the authors.

Special joint session with

Following the success of last year's joint session, we especially encourage submissions applying genetic programming to problems in computational biology and new biology-inspired extensions of the genetic programming framework for a special joint session of EvoBIO with EuroGP. These papers can be submitted either to EvoBIO or EuroGP.

Submission Details

Submissions must be original and not published elsewhere. The submissions will be peer reviewed by at least three members of the program committee. The authors of accepted papers will have to improve their paper on the basis of the reviewers’ comments and will be asked to send a camera ready version of their manuscripts. At least one author of each accepted work has to register for the conference and attend the conference and present the work.

The reviewing process will be double-blind, please omit information about the authors in the submitted paper. Submit your manuscript in Springer LNCS format at http://myreview.csregistry.org/evobio13/. EvoBIO is interested in papers in three major areas:

  1. Full research articles (maximum 12 pages) describing new methodologies, approaches, and/or applications (oral or poster presentation).
  2. Short reports (maximum 8 pages, poster presentation) describing new methodologies, approaches, and/or applications and System Demonstrations (maximum 8 pages) outlining the nature of the system and describe why the demonstration is likely to be of interest for the conference. Demonstrations of interest include systems under development or in active use in research or practice domains. Selected demo submissions may be asked to give an oral presentation in the conference sessions.
  3. Abstracts (maximum 4 pages) discussing work previously published in a journal: it is therefore essential that a reference to the previous article is clearly cited in the abstract (oral or poster presentation).

Submission deadline: 1 November 2012 Extended to 11 November 2012

Programme Committee





Feature Selection and Classification of High Dimensional Mass Spectrometry Data: A Genetic Programming Approach
Soha Ahmed, Mengjie Zhang, Lifeng Peng
Biomarker discovery using mass spectrometry (MS) data is very useful in disease detection and drug discovery. The process of biomarker discovery in MS data must start with feature selection as the number of features in MS data is extremely large (e.g. thousands) while the number of samples is comparatively small. In this study, we propose the use of genetic programming (GP) for automatic feature selection and classification of MS data. This GP based approach works by using the features selected by two feature selection metrics, namely information gain (IG) and relief-f (REFS-F) in the terminal set. The feature selection performance of the proposed approach is examined and compared with IG and REFS-F alone on five MS data sets with different numbers of features and instances. Naive Bayes (NB), support vector machines (SVMs) and J48 decision trees (J48) are used in the experiments to evaluate the classification accuracy of the selected features. Meanwhile, GP is also used as a classification method in the experiments and its performance is compared with that of NB, SVMs and J48. The results show that GP as a feature selection method can select a smaller number of features with better classification performance than IG and REFS-F using NB, SVMs and J48. In addition, GP as a classification method also outperforms NB and J48 and achieves comparable or slightly better performance than SVMs on these data sets.


Hybrid Multiobjective Artificial Bee Colony with Differential Evolution Applied to Motif Finding  (EvoBIO Best Paper Candidate)
David L. González-Álvarez, Miguel A. Vega-Rodríguez, Juan A. Gómez-Pulido, Juan M. Sánchez-Pérez
The Multiobjective Artificial Bee Colony with Differential Evolution (MO-ABC/DE) is a new hybrid multiobjective evolutionary algorithm proposed for solving optimization problems. One important optimization problem in Bioinformatics is the Motif Discovery Problem (MDP), applied to the specific task of discovering DNA patterns (motifs) with biological significance, such as DNA-protein binding sites, replication origins or transcriptional DNA sequences. In this work, we apply the MO-ABC/DE algorithm for solving the MDP using as benchmark genomic data belonging to four organisms: drosophila melanogaster, homo sapiens, mus musculus, and saccharomyces cerevisiae. To demonstrate the good performance of our algorithm we have compared its results with those obtained by four multiobjective evolutionary algorithms, and their predictions with those made by thirteen well-known biological tools. As we will see, the proposed algorithm achieves good results from both computer science and biology point of views.


Multiple Threshold Spatially Uniform ReliefF for the Genetic Analysis of Complex Human Diseases
Delaney Granizo-Mackenzie, Jason H. Moore
Detecting genetic interactions without running an exhaustive search is a difficult problem. We present a new heuristic, multiSURF*, which can detect these interactions with high accuracy and in time linear in the number of genes. Our algorithm is an improvement over the SURF* algorithm, which detects genetic signals by comparing individuals close to, and far from, one another and noticing whether differences correlate with different disease statuses. Our improvement consistently outperforms SURF* while providing a large runtime decrease by examining only individuals very near and very far from one another. Additionally we perform an analysis on real data and show that our method provides new information. We conclude that multiSURF* is a better alternative to SURF* in both power and runtime.


Supervising Random Forest Using Attribute Interaction Networks
Qinxin Pan, Ting Hu, James D. Malley, Angeline S. Andrew, Margaret R. Karagas
Jason H. Moore
Genome-wide association studies (GWAS) have become a powerful and affordable tool to study the genetic variations associated with common human diseases. However, only few of the loci found are associated with a moderate or large increase in disease risk and therefore using GWAS findings to study the underlying biological mechanisms remains a challenge. One possible cause for the "missing heritability" is the gene-gene interactions or epistasis. Several methods have been developed and among them Random Forest (RF) is a popular one. RF has been successfully applied in many studies. However, it is also known to rely on marginal main effects. Meanwhile, networks have become a popular approach for characterizing the space of pairwise interactions systematically, which can be informative for classification problems. In this study, we compared the findings of Mutual Information Network (MIN) to that of RF and observed that the variables identified by the two methods overlap with differences. To integrate advantages of MIN into RF, we proposed a hybrid algorithm, MIN-guided RF (MINGRF), which overlays the neighborhood structure of MIN onto the growth of trees. After comparing MINGRF to the standard RF on a bladder cancer dataset, we conclude that MINGRF produces trees with a better accuracy at a smaller computational cost.



ACO-based Bayesian Network Ensembles for the Hierarchical Classification of Ageing-Related Proteins
Khalid Salama, Alex Freitas
The task of predicting protein functions using computational techniques is a major research area in the field of bioinformatics. Casting the task into a classification problem makes it challenging, since the classes (functions) to be predicted are hierarchically related, and a protein can have more than one function. One approach is to produce a set of local classifiers; each is responsible for discriminating between a subset of the classes in a certain level of the hierarchy. In this paper we tackle the hierarchical classification problem in a local fashion, by learning an ensemble of Bayesian network classifiers for each class in the hierarchy and combining their outputs with four alternative methods: a) selecting the best classifier, b) majority voting, c) weighted voting, and d) constructing a meta-classifier. The ensemble is built using ABC-Miner, our recently introduced Ant-based Bayesian Classification algorithm. We use different types of protein representations to learn different classification models. We empirically evaluate our proposed methods on an ageing-related protein dataset created for this research.


Inferring Human Phenotype Networks from Genome-Wide Genetic Associations (EvoBIO Best Paper Candidate)
Christian Darabos, Kinjal Desai, Richard Cowper-Sallari, Mario Giacobini,
Britney E. Graham, Mathieu Lupien, Jason H. Moore  
Networks are commonly used to represent and analyze large and complex systems of interacting elements. We build a human phenotype network (HPN) of over 600 physical attributes, diseases, and behavioral traits; based on more than 6,000 genetic variants (SNPs) from Genome-Wide Association Studies data. Using phenotype-to-SNP associations, and HapMap project data, we link traits based on the common patterns of human genetic variations, expanding previous studies from a gene-centric approach to that of shared risk-variants. The resulting network has a heavily right-skewed degree distribution, placing it in the scale-free region of the network topologies spectrum. Additional network metrics hint that the HPN shares properties with social networks. Using a standard community detection algorithm, we construct phenotype modules of similar traits without applying expert biological knowledge. These modules can be assimilated to the disease classes. However, we are able to classify phenotypes according to shared biology, and not arbitrary disease classes. We present a collection of documented clinical connections supported by the network. Furthermore, we highlight phenotypes modules and links that may underlie yet undiscovered genetic interactions. Despite its simplicity and current limitations the HPN shows tremendous potential to become a useful tool both in the unveiling of the diseases' common biology, and in the elaboration of diagnosis and treatments.


Dimensionality reduction via Isomap with lock-step and elastic measures for time series gene expression classification
Carlotta Orsenigo, Carlo Vercellis
Isometric feature mapping (Isomap) has proven high potential for nonlinear dimensionality reduction in a wide range of application domains. Isomap finds low-dimensional data projections by preserving global geometrical properties, which are expressed in terms of the Euclidean distances among points. In this paper we investigate the use of a recent variant of Isomap, called double-bounded tree-connected Isomap (dbt-Isomap), for dimensionality reduction in the context of time series gene expression classification. In order to deal with the projection of temporal sequences dbt-Isomap is combined with different lock-step and elastic measures which have been extensively proposed to evaluate time series similarity. These are represented by three Lp-norms, dynamic time warping and the distance based on the longest common subsequence model. Computational experiments concerning the classification of two time series gene expression data sets showed the usefulness of dbt-Isomap for dimensionality reduction. Moreover, they highlighted the effectiveness of L1-norm which appeared as the best alternative to the Euclidean metric for time series gene expression embedding.


Knowledge-constrained K-medoids Clustering of Regulatory Rare Alleles for Burden Tests
R. Michael Sivley, Alexandra E. Fish, William S. Bush
Rarely occurring genetic variants are hypothesized to influence human diseases, but statistically associating these rare variants to disease is challenging due to a lack of statistical power in most feasibly sized datasets.  Several statistical tests have been developed to either collapse multiple rare variants from a genomic region into a single variable (presence/absence) or to tally the number of rare alleles within a region, relating the burden of rare alleles to disease risk.  Both these approaches, however, rely on user-specification of a genomic region to generate these collapsed or burden variables, usually an entire gene.  Recent studies indicate that most risk variants for common diseases are found within regulatory regions, not genes.  To capture the effect of rare alleles within non-genic regulatory regions for burden tests, we contrast a simple sliding window approach with a knowledge-guided k-medoids clustering method to group rare variants into statistically powerful, biologically meaningful windows.  We apply these methods to detect genomic regions that alter expression of nearby genes. 


Time-point Specific Weighting Improves Coexpression Networks from Time-course Experiments  (EvoBIO Best Paper Candidate)
Jie Tan, Gavin Grant, Michael Whitfield, Casey Greene
Integrative systems biology approaches build, evaluate, and combine data from thousands of diverse experiments.  These strategies rely on methods that effectively identify and summarize gene-gene relationships within individual experiments.  For gene-expression datasets, the Pearson correlation is often applied to build coexpression networks because it is both easily interpretable and quick to calculate.  Here we develop and evaluate weighted Pearson correlation approaches that better summarize gene expression data into coexpression networks for synchronized cell cycle time-course experiments.  These methods use experimental measurements of cell cycle synchrony to estimate appropriate weights through either sliding window or linear regression approaches.  We show that these weights improve our ability to build coexpression networks capable of identifying phase-specific functional relationships between genes.  We evaluate our method on diverse experiments and find that both weighted strategies outperform the traditional method.  This weighted correlation approach is implemented in the Sleipnir library, an open source library used for integrative systems biology.  Integrative approaches using properly weighted time-course experiments will provide a more detailed understanding of the processes studied in such experiments.


Structured populations and the maintenance of sex
Peter A. Whigham, Grant Dick, Alden Wright, Hamish G. Spencer
The maintenance of sexual populations has been an ongoing issue for evolutionary biologists, largely due to the two-fold cost of sexual versus asexual reproduction.  Many explanations have been proposed to explain the benefits of sex, including the role of recombination in maintaining diversity and the elimination of detrimental mutations, the advantage of sex in rapidly changing environments, and the role of spatial structure, finite population size and drift.  Many computational models have been developed to explore theories relating to sexual populations; this paper examines the role of spatial structure in supporting sexual populations, based on work originally published in 2006.  We highlight flaws in the original model and develop a simpler, more plausible model that demonstrates the role of mutation, local competition and dispersal in maintaining sexual populations.



Mining for Variability in the Coagulation Pathway: A Systems Biology Approach
Davide Castaldi, Daniele Maccagnola, Daniela Mari, Francesco Archetti
In this paper authors perform a variability analysis of a Stochastic Petri Net (SPN) model of the Tissue Factor induced coagulation cascade, one of the most complex biochemical networks. This pathway has been widely analyzed in literature mostly with ordinary differential equations, outlining the general behaviour but without pointing out the intrinsic variability of the system. The SPN formalism can introduce uncertainty to capture this variability and, through computer simulation allows to generate analyzable time series, over a broad range of conditions, to characterize the trend of the main system molecules. We provide a useful tool for the development and management of several observational studies, potentially customizable for each patient. The SPN has been simulated using Tau-Leaping Stochastic Simulation Algorithm, and in order to simulate a large  number of models, to test different scenarios, we perform them using High Performance Computing. We analyze different settings for model representing the cases of healthy and  different unhealthy subjects, comparing and testing their variability in order to gain valuable biological insights.


Cell-based Metrics Improve the Detection of Gene-Gene Interactions using Multifactor Dimensionality Reduction
Jonathan M. Fisher, Peter Andrews, Jeff Kiralis, Nicholas A. Sinnott-Armstrong, Jason H. Moore
Multifactor Dimensionality Reduction (MDR) is a widely- used data-mining method for detecting and interpreting epistatic effects that do not display significant main effects. MDR produces a reduced- dimensionality representation of a dataset which classifies multi-locus genotypes into either high- or low-risk groups. The weighted fraction of cases and controls correctly labelled by this classification, the bal- anced accuracy, is typically used as a metric to select the best or most-fit model. We propose two new metrics for MDR to use in evaluating models, Variance and Fisher, and compare those metrics to two previously-used MDR metrics, Balanced Accuracy and Normalized Mutual Information. We find that the proposed metrics consistently outperform the existing metrics across a variety of scenarios.


Impact of Different Recombination Methods in a Mutation-Specific MOEA for a Biochemical Application
Susanne Rosenthal, Nail El-Sourani, Markus Borschbach
Peptides play a key role in the development of drug candidates and diagnostic interventions, respectively. The design of peptides is cost-intensive and difficult in general for several well-known reasons. Multi-objective evolutionary algorithms (MOEAs) introduce adequate in silico methods for finding optimal peptides sequences which optimizes several molecular properties. A mutation-specific fast non-dominated sorting GA (termed MSNSGA-II) was especially designed for this purpose. In this work, an empirical study is presented about the performance of MSNSGA-II which is extended by optionally three different recombination operators. The main idea is to gain an insight into the significance of recombination for the performance of MSNSGA-II in general - and to improve the performance with these intuitive recombination methods for biochemical optimization. The benchmark test for this study is a three-dimensional optimization problem, using fitness functions provided by the BioJava library.


Optimal Use of Biological Expert Knowledge from Literature Mining in Ant Colony Optimization for Analysis of Epistasis in Human Disease
Arvis Sulovari, Jeff Kiralis, Jason H. Moore
The fast measurement of millions of sequence variations across the genome is possible with the current technology. As a result, a difficult challenge arise in bioinformatics: the identification of combinations of interacting DNA sequence variations predictive of common disease [1]. The Multifactor Dimensionality Reduction (MDR) method is capable of analysing such interactions but an exhaustive MDR search would require exponential time. Thus, we use the Ant Colony Optimization (ACO) as a stochastic wrapper. It has been shown by Greene et al. that this approach, if expert knowledge is incorporated, is effective for analysing large amounts of genetic variation[2]. In the ACO method integrated in the MDR package, a linear and an exponential probability distribution function can be used to weigh the expert knowledge. We generate our biological expert knowledge from a network of gene-gene interactions produced by a literature mining platform, Pathway Studio. We show that the linear distribution function is the most appropriate to weigh our scores when expert knowledge from literature mining is used. We find that ACO parameters significantly affect the power of the method and we suggest values for these parameters that can be used to optimize MDR in Genome Wide Association Studies that use biological expert knowledge.


Emergence of motifs in model gene regulatory networks
Marcin Zagórski
Gene regulatory networks arise in all living cells, allowing the control of gene expression patterns. The study of their circuitry has revealed that certain subgraphs of interactions or motifs appear at anomalously high frequencies. We investigate here whether the overrepresentation of these motifs can be explained by the functional capabilities of these networks. Given a framework for describing regulatory interactions and dynamics, we consider in the space of all regulatory networks those that have a prescribed function. Markov Chain Monte Carlo sampling is then used to determine how these functional networks lead to specific motif statistics in the interaction structure. We conclude that different classes of network motifs are found depending on the functional constraint (multi-stability or oscillatory behaviour) imposed on the system evolution. The discussed computational framework can also be used  for predicting regulatory interactions, if only the experimental gene expression pattern is provided.


An Evolutionary Approach to Wetlands Design
Marco Gaudesi, Andrea Marion, Tommaso Musner, Giovanni Squillero, Alberto Tonda
Wetlands are artificial basins that exploit the capabilities of some species of plants to purify water from pollutants. The design process is currently long and laborious: such vegetated areas are inserted within the basin by trial and error, since there is no automatic system able to maximize the efficiency in terms of filtering. Only at the end of several attempts, experts are able to determine which is the most convenient configuration and choose up a layout. This paper proposes the use of an evolutionary algorithm to automate both the placement and the sizing of vegetated areas within a basin. The process begins from a random population of solutions and, evaluating their efficiency with an state-of-the-art fluid-dynamics simulation framework, the evolutionary algorithm is able to automatically find optimized solution whose performance are comparable with those achieved by human experts.


Improving the Performance of CGPANN for Breast Cancer Diagnosis using Crossover and Radial Basis Functions
Timmy Manning, Paul Walsh
Recently published evaluations of the topology and weight evolving artificial neural network algorithm Cartesian genetic programming evolved artificial neural networks (CGPANN) have suggested it as a potentially powerful tool for bioinformatics problems. In this paper we provide an overview of the CGPANN algorithm and a brief case study of its application to the Wisconsin breast cancer diagnosis problem. Following from this, we introduce and evaluate the use of RBF kernels and crossover to CGPANN as a means of increasing performance and consistency.


A Multiobjective Proposal Based on the Firefly Algorithm for Inferring Phylogenies
Sergio Santander-Jiménez, Miguel A. Vega-Rodríguez
Recently, swarm intelligence algorithms have been applied successfully to a wide variety of optimization problems in Computational Biology. Phylogenetic inference represents one of the key research topics in this area. Throughout the years, controversy among biologists has arisen when dealing with this well-known problem, as different optimality criteria can give as a result discordant genealogical relationships. Current research efforts aim to apply multiobjective optimization techniques in order to infer phylogenies that represent a consensus between different principles. In this work, we apply a multiobjective swarm intelligence approach inspired by the behaviour of fireflies to tackle the phylogenetic inference problem according to two criteria: maximum parsimony and maximum likelihood. Experiments on four real nucleotide data sets show that this novel proposal can achieve promising results in comparison with other approaches from the state-of-the-art in Phylogenetics.


Hybrid Genetic Algorithms for Stress Recognition in Reading
Nandita Sharma, Tom Gedeon
Stress is a major problem facing our world today and affects everyday lives providing motivation to develop an objective understanding of stress during typical activities. Physiological and physical response signals showing symptoms for stress can be used to provide hundreds of features. This encounters the problem of selecting appropriate features for stress recognition from a set of features that may include irrelevant, redundant or corrupted features. In addition, there is also a problem for selecting an appropriate computational classification model with optimal parameters to capture general stress patterns. The aim of this paper is to determine whether stress can be detected from individual-independent computational classification models with a genetic algorithm (GA) optimization scheme from sensor sourced stress response signals induced by reading text. The GA was used to select stress features, select a type of classifier and optimize the classifierís parameters for stress recognition. The classification models used were artificial neural networks (ANNs) and support vector machines (SVMs). Stress recognition rates obtained from an ANN and a SVM without a GA were 68% and 67% respectively. With a GA hybrid, the stress recognition rate improved to 89%. The improvement shows that a GA has the capacity to select salient stress features and define an optimal classification model with optimized parameter settings for stress recognition.