Hessian Complexity Measure for Genetic Programming-based Imputation Predictor Selection in Symbolic Regression with Incomplete Data
Baligh Al-Helali, Qi Chen, Bing Xue and Mengjie Zhang
Missing values bring several challenges when learning from real-world data sets. Imputation is a widely adopted approach to estimating missing values. However, it has not been adequately investigated in symbolic regression. When imputing the missing values in an incomplete feature, the other features that are used in the prediction process are called imputation predictors. In this work, a method for imputation predictor selection using regularized genetic programming (GP) models is presented for symbolic regression tasks on incomplete data. A complexity measure based on the Hessian matrix of the phenotype of the evolving models is proposed. It is employed as a regularizer in the fitness function of GP for model selection and the imputation predictors are selected from the selected models. In addition to the baseline which uses all the available predictors, the proposed selection method is compared with two GP-based feature selection variations: the standard GP feature selector and GP with feature selection pressure. The trends in the results reveal that in most cases, using the predictors selected by regularized GP models could achieve a considerable reduction in the imputation error and improve the symbolic regression performance as well.
Seeding Grammars in Grammatical Evolution to Improve Search Based Software Testing
Muhammad Sheraz Anjum and Conor Ryan
Software-based optimization techniques have been increasingly used to automate code coverage analysis since the nineties. Although several studies suggest that interdependencies can exist between condition constructs in branching conditions of real life programs e.g. (i<=100) or (i==j), etc., to date, only the Ariadne system, a Grammatical Evolution (GE)-based Search Based Software Testing (SBST) technique, exploits interdependencies between variables to efficiently automate code coverage analysis. Ariadne employs a simple attribute grammar to exploit these dependencies, which enables it to very efficiently evolve highly complex test cases, and has been compared favourably to other well-known techniques in the literature. However, Ariadne does not benefit from the interdependencies involving constants e.g. (i<=100), which are equally important constructs of condition predicates. Furthermore, constant creation in GE can be difficult, particularly with high precision. We propose to seed the grammar with constants extracted from the source code of the program under test in order to enhance and extend Ariadne’s capability to exploit richer types of dependencies (involving all combinations of both variables and constant values). We compared our results with the original system of Ariadne against a large set of benchmark problems which include 10 numeric programs in addition to the ones originally used for Ariadne. Our results demonstrate that the seeding strategy not only dramatically improves the generality of the system, as it improved the code coverage (effectiveness) by impressive margins, but it also reduces the search budgets (efficiency) often up to an order of magnitude.
Incremental Evolution and Development of Deep Artificial Neural Networks
Filipe Assunção, Nuno Lourenço, Bernardete Ribeiro and Penousal Machado
NeuroEvolution (NE) methods are known for applying Evolutionary Computation to the optimisation of Artificial Neural Networks (ANNs). Despite aiding non-expert users to design and train ANNs, the vast majority of NE approaches disregard the knowledge that is gathered when solving other tasks, i.e., evolution starts from scratch for each problem, ultimately delaying the evolutionary process. To overcome this drawback, we extend Fast Deep Evolutionary Network Structured Representation (Fast-DENSER) to incremental development. We hypothesise that by re-using the knowledge gained from previous tasks we can attain superior results and speedup evolution. The results show that the average performance of the models generated by incremental development is statistically superior to non-incremental. In case the number of evaluations performed by incremental development is smaller than the performed by non-incremental development the attained results are similar in performance, which indicates that incremental development speeds up evolution. Lastly, the models generated using incremental development are more robust, and thus perform better, without further evolution, on unseen problems.
Comparing Genetic Programming Approaches for Non-Functional Genetic Improvement
Aymeric Blot and Justyna Petke
Genetic improvement (GI) uses automated search to find improved versions of existing software. While most GI work use genetic programming (GP) as the underlying search process, focus is usually given to the target software only. As a result, specifics of GP algorithms for GI are not well understood and rarely compared to one another. In this work, we propose a robust experimental protocol to compare different GI search processes and investigate several variants of GP- and random-based approaches. Through repeated experiments, we report a comparative analysis of these approaches, using one of the previously used GI scenarios: improvement of runtime of the MiniSAT satisfiability solver. We conclude that the test suites used have the most significant impact on the GI results. Both random and GP-based approaches are able to find improved software, even though the percentage of viable software variants is significantly smaller in the random case (14.5% vs. 80.1%). We also report that GI produces MiniSAT variants up to twice as fast as the original on sets of previously unseen instances from the same application domain.
Optimising Optimisers with Push GP
This work uses Push GP to automatically design both local and population-based optimisers for continuous-valued problems. The optimisers are trained on a single function optimisation landscape, using random transformations to discourage overfitting. They are then tested for generality on larger versions of the same problem, and on other continuous-valued problems. In most cases, the optimisers generalise well to the larger problems. Surprisingly, some of them also generalise very well to previously unseen problems, outperforming existing general purpose optimisers such as CMA-ES. Analysis of the behaviour of the evolved optimisers indicates a range of interesting optimisation strategies that are not found within conventional optimisers, suggesting that this approach could be useful for discovering novel and effective forms of optimisation in an automated manner.
An Evolutionary View on Reversible Shift-invariant Transformations
Luca Mariot, Stjepan Picek, Domagoj Jakobovic and Alberto Leporati
We consider the problem of evolving a particular kind of shift-invariant transformation — namely, Reversible Cellular Automata (RCA) defined by conserved landscape rules — using GA and GP. To this end, we employ three different optimization strategies: a single-objective approach carried out with GA and GP where only the reversibility constraint of marker CA is considered, a multi-objective approach based on GP where both reversibility and the Hamming weight are taken into account, and a lexicographic approach where GP first optimizes only the reversibility property until a conserved landscape rule is obtained, and then maximizes the Hamming weight while retaining reversibility. The results are discussed in the context of three different research questions stemming from exhaustive search experiments on conserved landscape CA, which concern 1) the difficulty of the associated optimization problem for GA and GP, 2) the utility of conserved landscape CA in the domain of cryptography and reversible computing, and 3) the relationship between the reversibility property and the Hamming weight.
SGP-DT: Semantic Genetic Programming Based on Dynamic Targets
Stefano Ruberto, Valerio Terragni and Jason Moore
Semantic GP is a promising approach that introduces semantic awareness during genetic evolution. This paper presents a new approach called Semantic Genetic Programming based on Dynamic Target (SGP-DT) that divides the search problem into multiple GP runs. The evolution in each run is guided by a new (dynamic) target based on the residual errors. To obtain the final solution, SGP-DT combines the solutions of each run using linear scaling. SGP-DT proposes a new approach to exchange functionalities between individuals without relying on the classic formulations of crossovers. The synergy between linear scaling and mutation yields to final solutions with remarkable performances both in approximation errors and computational cost. We valuate SGP-DT on eight well known data sets and compare with e-lexicase, a state-of-the-art evolutionary technique. SGP-DT achieves small RMSE values, on average 23.19 % smaller than the one of e-lexicase.
Effect of Parent Selection Methods on Modularity
Anil Saini and Lee Spector
The effects of various genetic operators and parent selection algorithms on the performance of a genetic programming system on different problems have been well studied. In this paper, we analyze how different selection algorithms influence modularity in the population of evolving programs. In particular, we observe how the number of individuals with some form of modular structure changes over generations for various selection algorithms.
Time Control or Size Control? Reducing Complexity and Improving Accuracy of Genetic Programming Models
Aliyu Sani Sambo, R. Muhammad Atif Azad, Yevgeniya Kovalchuk, Vivek Padmanaabhan Indramohan and Hanifa Shah
Complexity of evolving models in genetic programming (GP) can impact both the quality of the models and the evolutionary search. While previous studies have proposed several notions of GP model complexity, the size of a GP model is by far the most researched measure of model complexity. However, previous studies have also shown that controlling the size does not automatically improve the accuracy of GP models, especially the accuracy on out of sample (test) data. Furthermore, size does not represent the functional composition of a model, which is often related to its accuracy on test data. In this study, we explore the evaluation time of GP models as a measure of their complexity; we define the evaluation time as the time taken to evaluate a model over some data. We demonstrate that the evaluation time reflects both a model’s size and its composition; also, we show how to measure the evaluation time reliably. To validate our proposal, we leverage four well-known methods to size-control but to control evaluation times instead of the tree sizes; we thus compare size-control with time-control. The results show that time-control with a nuanced notion of complexity produces more accurate models on 17 out of 20 problem scenarios. Even when the models have slightly greater times and sizes, time-control counterbalances via superior accuracy on both training and test data. The paper also argues that time-control can differentiate functional complexity even better in an identically-sized population. To facilitate this, the paper proposes Fixed Length Initialisation (FLI) that creates an identically-sized but functionally-diverse population. The results show that while FLI particularly suits time-control, it also generally improves the performance of size-control. Overall, the paper poses evaluation-time as a viable alternative to tree sizes to measure complexity in GP.
Challenges of Program Synthesis with Grammatical Evolution
Dominik Sobania and Franz Rothlauf
Program synthesis is an emerging research topic in the field of EC with the potential to improve real-world software development. Grammar-guided approaches like GE are suitable for program synthesis as they can express common programming languages with their required properties. This work uses common software metrics (lines of code, McCabe metric, size and depth of the abstract syntax tree) for an analysis of GE’s search behavior and the resulting problem structure. We find that GE is not able to solve program synthesis problems, where correct solutions have higher values of the McCabe metric (which means they require conditions or loops). Since small mutations of high-quality solu- tions strongly decrease a solutions fitness and make a high percentage of the solutions non-executable, the resulting problem constitutes a needle- in-a-haystack problem. To us, one of the major challenges of future GP research is to come up with better and more adequate fitness functions and problem specifications to turn the current needle-in-a-haystack prob- lems into problems that can be solved by guided search.
Detection of Frailty using Genetic Programming : The Case of Older People in Piedmont, Italy
Adane Tarekegn, Fulvio Ricceri, Giuseppe Costa, Elisa Ferracin and Mario Giacobini
Frailty appears to be the most problematic expression of elderly people. Frail older adults have a high risk of mortality, hospitalization, disability and other adverse outcomes, resulting in burden to individuals, their families, health care services and society. Early detection and screening would help to deliver preventive interventions and reduce the burden of frailty. For this purpose, several studies have been conducted to detect frailty that demonstrates its association with mortality and other health outcomes. Most of these studies have focused on the possible risk factors associated with frailty in the elderly population; however, eﬀorts to identify and predict groups of elderly people who are at increased risk of frailty is still challenging in clinical settings. In this paper, Genetic Programming (GP) is exploited to detect and deﬁne frailty based on the whole elderly population of the Piedmont, Italy, using administrative databases of clinical characteristics and socio-economic factors. Speciﬁcally, GP is designed to predict frailty according to the expected risk of mortality, urgent hospitalization, disability, fracture, and access to the emergency department. The performance of GP model is evaluated using sensitivity, speciﬁcity, and accuracy metrics by splitting the data into a training set and test set. We ﬁnd that GP shows competitive performance in predicting frailty compared to the traditional machine learning models. The study demonstrates that the proposed model might be used to screen future frail older adults using clinical, psychological and socio-economic variables, which are commonly collected in community healthcare institutions.
Is k Nearest Neighbours Regression Better than GP?
Leonardo Vanneschi, Mauro Castelli, Luca Manzoni, Sara Silva and Leonardo Trujillo
The original objective of this work was to define a new geometric semantic mutation for genetic programming, that was able to generate more manageable models than the traditional one. This paper tells the story of how, starting from this objective, we ended up with rather different findings. The new mutation replaces the random trees used by the traditional one with a vector of random numbers. To evaluate the individuals on unseen data, it uses the output on the k closest training observations, analogously to the k nearest neighbours. The results presented in the first part of the paper demonstrate that the new mutation outperforms the traditional one. In the second part, we show that the new mutation simply approximates the k nearest neighbours, returning comparable, but not better, results. This finding raises several questions, that arrive up to doubting the opportunity of using genetic programming itself. However, comparisons with state-of-the art machine learning techniques (random forests) and an analysis of relevant literature, allow us to provide a more nuanced perspective. The paper terminates leaving space for future research in genetic programming.
Investigating the Use of Geometric Semantic Operators in Vectorial Genetic Programming
Irene Azzali, Leonardo Vanneschi and Mario Giacobini
Vectorial Genetic Programming (VE_GP) is a new GP approach for panel data forecasting. Besides permitting the use of vectors as terminal symbols to represent time series and including aggregate functions to extract time series features, it introduces the possibility of evolving the window of aggregation. The local aggregation of data allows the identification of meaningful patterns overcoming the drawback of considering always the previous history of a series of data. In this work, we investigate the use of geometric semantic operators (GSOs) in VE_GP, comparing its performance with traditional GP with GSOs. Experiments are conducted on two real panel data forecasting problems, one allowing the aggregation on moving windows, one not. Results show that classical VE_GP is the best approach in both cases in terms of predictive accuracy, suggesting that GSOs are not able to evolve efficiently individuals when time series are involved. We discuss the possible reasons of this behaviour, to understand how we could design valuable GSOs for time series in the future.
Automatically Evolving Lookup Tables for Function Approximation
Oliver Krauss and William B. Langdon
Many functions, such as square root, are approximated and sped up with lookup tables containing pre-calculated values. We introduce an approach using genetic algorithms to evolve such lookup tables for any smooth function. It provides double precision and calculates most values to the closest bit, and outperforms reference implementations in most cases with competitive run-time performance.
Benchmarking manifold learning methods on a large collection of datasets
Patryk Orzechowski, Franciszek Magiera and Jason Moore
Manifold learning, a non-linear approach of dimensionality reduction, assumes that the dimensionality of multiple datasets is artificially high and a reduced number of dimensions is sufficient to maintain the information about the data. In this paper, a large scale comparison of manifold learning techniques is performed for the task of classification. We show the current standing of genetic programming (GP) for the task of classification by comparing the classification results of two GP-based manifold leaning methods: GP-Mal and ManiGP – an experimental manifold learning technique proposed in this paper. We show that GP-based methods can more effectively learn a manifold across a set of 155 different problems and deliver more separable embeddings than many established methods. We also ask questions about the fairness of benchmarking.
Ensemble Genetic Programming
Nuno M. Rodrigues, João E. Batista and Sara Silva
Ensemble learning is a powerful paradigm that has been used in the top state-of-the-art machine learning methods like Random Forests and XGBoost. Inspired by the success of such methods, we have developed a new Genetic Programming method called Ensemble GP. The evolutionary cycle of Ensemble GP follows the same steps as other Genetic Programming systems, but with differences in the population structure, fitness evaluation and genetic operators. We have tested this method on eight binary classification problems, achieving results significantly better than standard GP, with much smaller models. Although other methods like M3GP and XGBoost were the best overall, Ensemble GP was able to achieve exceptionally good generalization results on a particularly hard problem where none of the other methods was able to succeed.
Guided Subtree Selection for Genetic Operators in Genetic Programming for Dynamic Flexible Job Shop Scheduling
Fangfang Zhang, Yi Mei, Su Nguyen and Mengjie Zhang
The dynamic flexible job shop scheduling (DFJSS) problem has been widely concerned in both academia and industry. Both machine assignment and operation sequencing decisions need to be made simultaneously as an operation can be processed by a set of machines in DFJSS. Scheduling heuristic has become an effective way to solve the DFJSS problem due to its efficiency and simplicity. Genetic programming (GP) has been successfully applied to evolve scheduling heuristics for job shop scheduling automatically. Crossover and mutation are two basic genetic operators to generate offsprings in GP. However, the subtrees of the selected parents are randomly chosen in traditional GP, which may not be sufficiently effective, especially in a huge search space. In this paper, we propose new strategies to guide the subtree selection rather than pick them randomly. To be specific, the occurrences of features are used to measure the importance of each subtree of the selected parents. The probability that a subtree is selected will be set according to its importance and the type of genetic operators. The results show that the proposed GP algorithm with the guided subtree selection for crossover can converge faster and achieve significantly better performance than its counterpart in half of the scenarios while no worse in all other scenarios without increasing the computational time.
Classification of Autism Genes using Network Science and Linear Genetic Programming
Yu Zhang, Yuanzhu Chen and Ting Hu
Understanding the genetic background of complex diseases and disorders plays an essential role in the promising precision medicine. Deciphering what genes are associated with a specific disease/disorder helps better diagnose and treat it, and may even prevent it if predicted accurately and acted on effectively at early stages. The evaluation of candidate disease-associated genes, however, requires time-consuming and expensive experiments given the large number of possibilities. Due to such challenges, computational methods have seen increasing applications in predicting gene-disease associations. Given the intertwined relationships of molecules in human cells, genes and their products can be considered to form a complex molecular interaction network. Such a network can be used to find candidate genes that share similar network properties with known disease-associated genes. In this research, we investigate autism spectrum disorders and propose a linear genetic programming algorithm for autism-gene prediction using a human molecular interaction network and known autism-genes for training. We select an initial set of network properties as features and our LGP algorithm is able to find the most relevant features while evolving accurate predictive models. Our research demonstrates the powerful and flexible learning abilities of GP on tackling a significant biomedical problem, and is expected to inspire further exploration of wide GP applications.