LearningBased Performance Specialization of Configurable Systems
A large scale configurable system typically offers thousands of options or parameters to let the engineers customize it for specific needs. Among the resulting many billions possible configurations, relating option and parameter values to desired performance is then a daunting task relying on a deep know how of the internals of the configurable system. We propose a staged configuration process to narrow the space of possible configurations to a good approximation of those satisfying the wanted high level customer requirements. Based on an oracle (e.g. a runtime test) that tells us whether a given configuration meets the nonfunctional requirements (e.g. speed or memory footprint), we leverage machine learning to retrofit the acquired knowledge into a variability model of the system that can be used to automatically specialize the configurable system.
SPLC’16: In the repository https://github.com/learningconstraints/SPLC16, you will find all the required artefacts to replicate the experiments included in our paper submitted to the 20th International Systems and Software Product Line Conference. The goal was mostly to avoid faulty configurations and correct a video generator developed in the industry. The oracle was simply a yes/no procedure.
In RR’17, we have considered performance qualities. The oracle is parametrized with an objective value that users of our specialization method can control. Hence our proposal can be seen as a generalization of the idea originally proposed in SPLC’16. Moreover our empirical results cover more systems and classification metrics while taking training sizes and objective values into account https://github.com/learningconstraints/ICSE17.
Objective of this page: Our RR’17 results show that, for many different kinds of objectives and performance qualities, the approach has interesting accuracy, precision and recall after a learning stage based on a relative small number of random samples. This page describes how we gather and process input data, how we compute classification metrics. Furthermore, all plots, heatmaps, and figures can be visualized. We aim here to supplement our original material (it is impossible to include all relevant figures we have generated!) and also support an independent verification and replication of our study. We also aim to provide a practical reading grid of classification metrics for users of our method.
A research report is also freely available online https://hal.archivesouvertes.fr/hal01467299: ` +Paul Temple, Mathieu Acher, JeanMarc A Jézéquel, Léo A NoelBaron, and José A Galindo. “LearningBased Performance Specialization of Configurable Systems” (2017), Research report, University of Rennes 1 (IRISA/Inria). ` ## Empirical Study ## Empirical Study
Empirical Study
Subject systems
We performed our experiments on a publiclyavailable dataset used in previous works. The dataset includes sets of configurations of realworld configurable systems that have been executed and measured over different performance metrics (more details hereafter). Configurable systems come from various domains and are implemented in different languages (C, C++ and Java). The number of options varies from as little as ten, up to about sixty; options can take Boolean and numerical values. The two last columns show, first, the total number of different possible configurations existing for the system and, last, the number of configurations that have been generated and measured in previous works (“All” meaning all different configurations have been measured). It should be noted that most benchmarks measure a single performance value, except for x264 where benchmarks evaluate 7 performance metrics.
Experimental setup
For each subject system, we have randomly selected a certain number of configurations from the ones available in the public dataset: our ML method operates over a training sample to learn a separating function. All remaining configurations form the test sample, i.e., it is the whole configuration set minus the training set. They are used to verify whether the classification of our ML is correct. For instance, the public dataset contains 69,000 different configurations for x264. If we used 500 of them to learn constraints, then we evaluate the classification quality of ML on the remaining 68,500 configurations.
For each configurable system, we make vary the sample size one by one, from 1 to the total number of configurations. We make vary the objective value between the lower and upper bounds of the performance measured in each dataset. We notice that threshold values have direct influence on the percentage of acceptable configurations. For instance, for x264, an execution time (speed) threshold less than 100s leads to only 21.4% of acceptable configurations. We thus vary the objective value from 0% to 100% of acceptable configuration with steps of 5%.
Classification measurements
We consider that configurations are acceptable (resp. nonacceptable) w.r.t a performance objective.
Our learningbased technique can introduce errors in the following cases:
 a configuration is valid (and thus classified as acceptable) despite being nonacceptable â€“ the system is underconstrained and unsafe;
 a configuration is not valid (and thus classified as nonacceptable) despite being acceptable (the system is overconstrained and lacks flexibility).
We expect to observe such cases:
 a configuration is valid and correctly classified as acceptable;
 a configuration is nonvalid and correctly classified as nonacceptable.
In our case, positive class (also called class1) corresponds to nonacceptable configurations that we want to discard. The negative class (also called class0) corresponds to acceptable configurations that we want to keep. There are four possible situations represented in a confusion matrix (see Figure):
 True Positives (TP) and True Negatives (TN) are on the main diagonal and represent the set of correctly classified configurations;
 False Positive (FP) and False Negatives (FN), on the contrary, represent classification errors
For obtaining TP, TN, FP, and FN, we confront the classifications made by ML in the training set with the oracle decisions available in the testing set. Based on a confusion matrix, we can compute several classification metrics.
We rely on standard metrics used for binary classification problems. We used 5 metrics since they give a different and complementary perspective on the “quality” of our learningbased method.
Classification metrics
 Accuracy measures the portion of configurations accurately classified as acceptable and nonacceptable. It gives a global classification “performance” of the learning phase.
However accuracy alone can have limited interest since it hinders important phenomena.
In the example (see Figure), you can have a high accuracy despite the clear limits of the learning phase of identifying nonacceptable configurations (precision is 29% and recall is 40%).
So why is there a high accuracy and a low precision and recall?
Because the class of “acceptable configurations” is much more represented (800+100=900 out of 1000). It is too easy to have a high accuracy with basically no specialization.
On the one hand, accuracy gives a global classification measurement of our method.
On the other hand, accuracy is unable to show the weaknesses of a naÃ¯ve classification strategy (e.g., “keeping everything as acceptable”)
Intuitively, in our specific problem, we also need to measure the precision and recall of our classification. It can give better insights on our ability to (1) discard nonacceptable configurations and (2) to keep acceptable configurations.

Precision: measure our ability to correctly classify nonacceptable configurations. In our example, precision is 40 / (100 + 40) = 40 / 140 = 29%. The learning wrongly classifies a large portion of nonacceptable configurations.

Recall (also called true positive rate): measure our ability to comprehensively identify nonacceptable configurations. In our example, recall is 40 / (40 + 60) = 40%. Here, the learning misses a large portion of nonacceptable configurations.
A low recall may be problematic, since numerous nonacceptable configurations have not been classified as such: Users may still choose unsafe configurations. A low precision may be problematic as well since numerous nonacceptable configurations are actually acceptable: Users may lose some flexibility.
Precision and recall measure the ability of our system to classify nonacceptable configurations. However it says nothing on how safe and flexible is our resulting specialized configurable system:
 How many valid configurations of our resulting specialized configurable system are truly acceptable? A low proportion and users will perform numerous configuration errors since the system is not enough safe.
 How many configurations have been unnecessarily removed whereas they should still be reachable? A low proportion and users will have little configuration choices at their disposal.
We thus need two additional metrics. Specificity (also called true negative rate) quantifies how flexible is the specialized system. In the example, (800 / 800 + 100) = 89% of configurations will be reachable in the new specialized system.
NPV (also called negative predictive value) quantifies how safe is the specialized system. In In the example, (800 / 800 + 60) = 93% of configurations will be safe in the new specialized system.
How does a learningbased method compete with a nonlearning based method?
The example shows that, despite a low precision (29%) and recall (40%), the resulting specialized configurable system is quite safe (93%) and flexible (89%). Our learning phase misses some opportunities and introduces some errors, but the overall process may still be of interest since we have a safer system than originally and the flexibility is arguably reasonable.
Without a specialization, all configurations are classified as acceptable, and the metrics are as follows in our specific example:
 accuracy is 90%
 precision is ~0
 recall is 0
 specificity is always 100% (the system is highly flexible)
 NPV is 90% (the system is quite safe), but it is less than 93%
It shows an improvement of 3% for the safety of the system. Obviously, it is just an example. The potential of our technique mostly resides in cases the performance objective leads to numerous nonacceptable configurations: We can learn and identify them, leading to safer configurable systems.
In general, a nonlearning based method has always a precision is equal to 0, recall is equal to 0, specificity is equal to 100%. Hence we cannot beat a nonlearning in terms of flexibility, since it keeps everything as acceptable.
However the safety of the resulting system may be very low. It is equal to: NPV_nonlearning = TN / (TN + TP) = number of acceptable configurations / number of configurations in total. We can thus compute a new metric, called negNPV, that is equal to NPV  NPV_nonlearning. The higher negNPV, the safer is our specialized system comparatively to a nonspecialized system. Our empirical results confirm that we are safer in the vast majority of cases, even for a small training set.
Similarly, we can compute negAccuracy (difference of accuracy between a nonspecialized method and our learningbased method).
Reading grid of classification metrics
To sum up:
 there are three classification metrics that measure our ability to classify (accuracy, precision, recall).
 accuracy gives a global classification “performance” of the learning phase
 accuracy alone has limitations since it does not quantify the classification of nonacceptable configurations; precision and recall should be used in complement
 high accuracy does not imply high precision or high recall: it is a theoretical statement (see formulae) and empirical results confirm there is no correlation between accuracy, precision and recall
 NPV and specificity characterize how safe and flexible is the resulting specialized system (precision and recall cannot)
 a high (resp. low) specifity means a high (resp. low) flexibility of the resulting system
 a high (resp. low) NPV means a high (resp. low) safety of the resulting system
 a low precision or low recall does not imply a low NPV or specificity. For instance, the classification metric can be low and, yet, leads to a highly safe and flexible system
A “specializer” (i.e., user of our specialization method) can employ:
 precision and recall for measuring the quality of the classification and perhaps identifying some “missing” opportunities. Overall it can drive the learning strategy: for example, with more (nonacceptable) configurations in the training set, precision and recall can perhaps be improved.
 NPV and specificity for measuring the safety and flexibility of the resulting system. In practice, there is (sometimes) a tradeoff to find between flexibility and safety, depending on the specificities of the configurable system and its usage. Let us take two examples. On the one hand, critical configurable systems have to be highly safe: Leaving a large portion of nonacceptable (and thus unsafe) configurations is not suited. On the other hand, there are configurable systems for which flexibility is the key since it attracts lots of customers. The extent to which safety and flexibility can be degraded is the decision of a “specializer”: NPV and specificity help in such a task.
An important point is that some performance objectives (w.r.t. execution time, energy consumption, etc.) can lead to a high (resp. low) percentage of nonacceptable configurations. It can complicate the learning phase and thus both influence accuracy, precision, and recall. The definition of performance objectives can also influence NPV and specificity. For example, it is hard to be completely safe when numerous nonacceptable configurations have to be removed. You can miss a small portion of them and, yet, it is a lot of unsafe configurations that remain within the specialized system. Another factor is the size of the training set. It may influence both accuracy, precision, recall, NPV, and specificity.
With our empirical results, we aim to further understand the effectiveness of our learningbased method w.r.t any performance objective and training set. For each configurable system, performance objective (defining the percentage of acceptable configurations in the population), and size of the training set, we have computed accuracy, precision, recall, NPV, specificity, negNPV and negAccuracy.
Reproducibility
For the sake of verifiability and reproducibility of the experiments we provide:
 All heatmaps, plots, and figures. We can only present a few figures in a typical academic paper (because of the page number limit) and we aim to supplement it.
 A git repository containing all data and the scripts used to: i) execute the procedures over different configurable systems; ii) generate all the different graphs, heatmaps, etc.
 Details on how we have collected and prepared data
Details of the content
The repository is available here: https://github.com/learningconstraints/ICSE17
Heatmaps
We have generated numerous heatmaps for each configurable system.
Xaxis: size of the training set Yaxis: percentage of acceptable configurations (as influenced by the objective treshold value) Classification metric: the darker, the higher is the classification metric (usually from 0 to 1) It includes accuracy, precision, recall, NPV, specificity, negNPV and negAccuracy. It is a total of 7 heatmaps per system, with variations of training sets and performance objectives.
The tables (CSV files) out of which we have produced heatmaps are also available.
Graphs/tables
In addition we have fixed the number of acceptable configurations (in the population and as defined by an objective value) to 20%, 50%, and 80% and we have ploted the evolution of classification metric (accuracy, precision, recall, etc.) with regards to the size of the training set.
 50% means that there are as nonacceptable configurations as there are acceptable ones.
 20% means that 80% of configurations are nonacceptable: It is a challenge to discard all of them.
 80% means that only 20% are nonacceptable: It can seen as challenging to learn from few nonacceptable configurations
Such graphs help to visualize the effect of the size of the training set. The general trend is that the more configurations we have in the training set, more effective our technique is. https://github.com/learningconstraints/ICSE17/tree/master/2.metric_view/results
Code
 Learning phase
 Computation of classification scores (see https://github.com/learningconstraints/ICSE17/tree/master/2.metric_view and https://github.com/learningconstraints/ICSE17/tree/master/3.synthetic_view)
 Depicting heatmaps or plots (https://github.com/learningconstraints/ICSE17/tree/master/1.heatmaps)
Input data
Performance measurements of subject systems. It is available here: https://github.com/learningconstraints/ICSE17/tree/master/datasets