Machine Learning in Software Testing

Early 2019 Edition

Introduction

Software testing has not really changed all that much in the past decades. Machine learning on the other hand is a very rapidly evolving technology being adopted all over the place. So what can it bring to software testing?

Back in 2018 (so about a year ago from now) I did a review of machine learning (ML) (and deep learning (DL)) applications in Network Analysis and Software Testing. After spending some more time learning and trying ML/DL in practice, this is an update on the ML for testing part, reflecting on my own learnings and some developments over this past year. Another interesting part would be testing ML system. I will get to that in another post.

In my last years review, I focused on several topics. A recent academic study (Durelli2019) in this area also lists a number of topics. This includes topics such as "learning test oracles", which basically translates to learning a model of a system behaviour based on some observations or other data about the software behaviour. Last years I included this under the name of specification mining. In practice, I have found such learned behavioral models to be of limited use, and have not seen general uptake anywhere in practice. In this review I focus on fewer topics I find more convincing for practical use.

I illustrate these techniques with this nice pic I made:

Topics

In this pic, the "Magic ML Oracle" is just a ML model, or a system using such a model. It learns from a set of inputs during the training phase. In the figure above this could be some bug reports linked to components (file handling, user interface, network, …). In the prediction phase it runs as a classifier, predicting something such as which component a reported issue should be assigned to, how fault-prone an analysis is (e.g., how to focus testing), or how tests and specs are linked (in case of missing links).

The topics I cover mainly relate to using machine learning to analyze various test related artefacts as in the figure above. One example of this is the bug report classifier I built previously. Since most of these ML techniques are quite general, just applied to software testing, ideas from broader ML applications could be useful here as well.

Specifically, software testing is not necessarily that different from other software engineering activities. For example, Microsoft performed an extensive study (Kim2017) on their data scientists and their work in software engineering teams. This work includes bug and performance analysis and prioritization, as well as customer feedback analysis, and various other quality (assurance) related topics.

As an example of concrete ML application to broader SW engineering, (Gu2018) maps natural language queries to source code to enable code search. To train a DL model for this, one recurrent neural network (RNN) based model is built for the code description (from source comments), and another one for the source code. The output of these two is a numerical feature vector. Cosine similarity is a measure used to compare how far apart two such vectors are, and here it is used as the training loss function. This is a nice trick to train a model to map source code constructs to natural language "constructs", enabling mapping short queries to new code in similar ways. It is also nicely described in the morning paper. I see no reason why such queries and mappings would not work for test related searches and code/documents as well. In fact, many of the techniques I list in following sections use quite similar approaches.

Like I said, I am focusing on a smaller set of more practical topics than last year in this still-overly-long post. The overall idea of how to apply these types of techniques in testing in general should not be too different though. This time, I categorize topics to test prioritization bug report localization, defect prediction, and traceability analysis. One section to go over each, one to rule over them all.

Test Prioritization

As (software) organizations and projects grow over time, their codebase tends to grow with them. This would lead to also having more tests to cover that codebase (hopefully…). Running all possible test cases all the time in such a scenario is not always possible or cost-efficient, as it starts to take more and more time and resources. The approach to selecting a subset of the tests to execute has different names: test prioritization, test suite optimization, test minimization, …

The goal with this being to cover as much of the fault-prone areas with fewer tests, such as in this my completely made up image to illustrate the topic:

Prioritized coverage

Consider the coverage % in the above to reflect, for example, covering changes since last time tests were run. Aiming to cover changes did not break anything as fast and efficient as possible.

An industrial test prioritization system used at Google (in 2017) is described in (Memon2017). This does not directly discuss using machine learning (although mentions it as future plan for the data). However, I find it interesting for general data-analysis of testing related data as a basis for test prioritization. It also works to provide a basis for a set of features for ML algorithms, as understanding and tuning your data is really the basis for all ML applications.

The goal in this (Memon2017) case is two-fold: better utilizing test resources (focus on potentially failing tests) and provide feedback to the developers about their commits. The aim is not 100% accurate predictions but rather focusing automated test execution and providing the developer with feedback such as "this commit is 95% more likely to cause breakage due to the code being touched by 5 developers in the past 10 days, and being written in Java". The developers can use this feedback to seek further assurance in additional reviews, more testing, static analysis, and so on.

Some of the interesting features/findings described in (Memon2017):

  • Only about 1% of the tests ever failed during their lifetime.
  • Thus about 99% speedup would be possible if right tests could be identified.
  • Use of dependency distance as a feature: What other component depends on the changed component, and through how many other components
  • Test targets further away from the change are much less likely to fail. So dependency distance seems like a useful prediction (feature) metric. They used a threshold of 10 for their codebase, which might vary by project but the idea likely holds.
  • Files/targets modified more often are more likely to cause breakages.
  • File type affects likelihood of breakage.
  • User/tool id affects likelihood of breakage.
  • Number of developers having worked on single file in a short time affects likelihood of breakage. More developers means higher likelihood to break.
  • The number of test targets affected by a change varies greatly, maybe requiring different treatment.

A similar set of features is presented in (Bhagwan2018):

  • Developer experience: Developer time in the organization and project
  • Code ownership: More developers changing files/components cause more bugs
  • Code hotspots: Specific parts of code that cause issues when changed
  • Commit complexity: Number of changes, changed files, review comments in a single commit. More equals more bugs.

A test prioritization approach taken at Salesforce.com is described in (Busjaeger2016). They use five types of features:

  • code coverage
  • text path similarity
  • text content similarity
  • test failure history
  • test age

In (Bhagwan2018), the similarity scores are based on TF-IDF scores and their cosine similarity calculation. TF-IDF simply weights frequency of words in a document against the frequency of the same word in all other documents, to identify most specific terms for document types. The features are fed into a support-vector model to rank tests to execute first. In their (Bhagwan2018) experience, about 3% of overall tests is required to reach about 75% coverage. From the predicted tests, about every 5th is found to be causing a failure.

I find the above provide good examples of data analysis, and basis for defining ML features.

In the (Durelli2019) review, several studies are listed under test prioritization, but these mostly do not strike me as very realistic examples of ML applications. However, one interesting approach is (Spieker2017), which investigates using reinforcement learning for test prioritization. It uses only three features: execution time (duration), last execution (whatever that means..), and failure history. These seem a rather simple set of features to build a complex model, and it seems likely to me that a simple model would also work here. The results in (Spieker2017) are presented as good but not investigated in depth so hard to say from just that. However, I did find the approach to present some interesting ideas in relation to this:

  • Continuous integration systems constantly execute the test suites so you will have a lot of constantly updating data about test suites, execution, results available
  • Continuously updating the model over time based on a last N test runs from past
  • Using a higher exploration rate over full suite to bootstrap the model, lowering over time when it has learned but not setting to zero
  • Using "test case" as model state, and assigning it a priority as an action
  • Listing real "open-data" industry-based datasets to evaluate prioritization ML models on

I would be interested to see how well a simple model, such as Naive Bayes, weighting the previous pass/fails and some pattern over their probability would work. But from the paper it is hard to tell. In any case, the points above would be interesting to explore further.

A Thought (maybe Two)

I assume ML has been applied to test prioritization, just not so much documented. For example, I expect Google would have taken their studies further and used ML for this as discussed in their report (Memon2017). Test prioritization seems like a suitably complex problem, with lots of easily accessible data, and with a clear payoff in sight, to apply ML. The more tests you have, the more you need to execute, the more data you get, the more potential benefit.

In this as in many advanced research topics, I guess the "killer app" might come by integrating all this into a test system / product as a black-box. This would enable everyone to make use of it without requiring to learn all the "ML in test" details outside their core business. Same I guess applies to the other topics I cover in the following sections.

Bug Report Localization

Bug report localization (in this case anyway) involves taking a bug report and finding the component or other part of the software that the report is most likely to concern. Various approaches aim to automate this process by using machine learning algorithms. My previous example is one example of building one.

I made a pretty picture to illustrate this:

Localization Oracle

Typically a bug report is expressed in natural language (at least partially, with code snippets embedded). These are fed to the machine learning classifier (magic oracle in the pic above), which assigns it to 1-N potential components. Component granularity and other details may wary but this is the general idea.

For this, code structural elements used as features include (Tufano2018):

  • sequences of abstract syntax tree (AST) nodes
  • sequences of call-flow-graphs (CFG) nodes
  • bytecode representations. This seems interesting in mapping the code to fewer shared elements (opcodes)

Other features include (Lam2017):

  • camel-case splitting source code (n-grams would seem a natural fit too)
  • time since a file was previously changed when fixing a bug
  • how many bugs overall have been fixed in a file
  • similarity between a bug report and previous bug reports (and what were they assigned to)

Besides using such specific code structures as inputs, also specific pre-processing steps are taken. These include (Tufano2018, Li2018):

  • replacing constant value with their types,
  • splitting camel-case,
  • removing low-level detailed abstract syntax tree (AST) nodes,
  • filtering out methods less than 10 lines long.
  • regular expressions to remove code format characters, and to identify code snippets embedded into the bug report.

An industrial case study on bug localization from Ericsson is presented in (Johnson2016). Topic models built with Latent Dirichlet Allocation (LDA) are learned from the set of bug reports. These are used to assign topic weights to bug reports based on the bug report text. The assigned weights are compared to the learned topic distribution for components, and the higher the match of distributions in the report vs learned component model, the higher the probability to assign the bug report to that component.

Vector Space Model (VSM) was used as a baseline comparison in many cases I found. This is based on TF-IDF scores (vectors) calculated for a document. Similarity between a bug report and source code files in VSM is calculated as a cosine similarity between their TF-IDF vectors. Revised Vector Space Model (rVSM) (Zhou2012) is a refinement of VSM that weights larger documents more, reasoning that bugs are more often found in larger source files. (Zhou2012) also adds weighting from similarity with previous bug reports.

Building on rVSM, (Lam2017) uses an auto-encoder neural network on TF-IDF weighted document terms to map different terms with similar meaning together for more accurate bug localization. Similarly, the "DeepSum" work (Li2018) uses an auto-encoder to summarize bug reports, and to compare their TF-IDF distance with cosine similarity. To me this use of auto-encoders seems like trying to re-invent word-embeddings for no obvious gain, but probably I miss something. After auto-encoding, (Lam2017) combines a set of features using a deep neural network (multi-layer perceptron (MLP) it seems) for final probability evaluation. In any case, word-embedding style mapping of words together in a smaller dimension is found in these works as others.

A Thought or Two

I am a bit surprised not to see much work in applying RNN type networks such as LSTM and GRU into these topics, since they are a great fit for processing textual documents. In my experience they are also quite powerful compared to traditional machine learning methods.

I think this type of bug report localization has practical relevance mainly for big companies with large product teams and customer bases, and complex processes (support levels, etc). This is in domains like telcos, from which the only clear industry report I listed here is from (Ericsson). Something I have found limiting these types of applications in practice is also the need for cross-domain vision to combine these topics and expertise. People seem often quite narrowly focused on specific areas. Black-box integration with common tools might help, again.

Defect Prediction

Software defect prediction refers to predicting which parts of the software are most likely to contain faults. Sometimes this is also referred to as fault proneness analysis. Aim is to provide additional information to help focus testing efforts. This is actually very similar to the bug report localization I discussed above, but with the goal of predicting where currently unknown bugs might appear (vs localizing existing issue reports).

An overall review of this area is presented in (Malhotra2015), showing an extensive use of traditional ML approaches (random forest, decision trees, support vector machines, etc) and traditional source code metrics (lines of code, cyclomatic complexity, etc.) as features. These studies show reasonably good accuracies up from 75% to 93%.

However, another broad review on these approaches and their effectiveness is presented in (Zhou2018). It shows how simply using larger module size to predict higher fault proneness would give equal or better accuracy in many cases. This is my experience from many contexts, keeping it simple is often very effective. But on the other hand, finding that simplicity can be the real challenge, and you can learn a lot by trying different approaches.

More recently, deep learning based approaches have also been applied to this domain. Deep Belief Nets (DBN) are applied in (Wang2018) to generate features from source code AST, and combined with more traditional source code metrics. The presentation on DBNs in (Zhou2018) is a bit unclear to me, but it seems very similar to a MLP. The output of this layer is then termed (as far as I understand) as "semantic feature vector". I looked a bit into the difference of DBN vs MLP, and found some practical discussion at a Keras issue. Make what you will of it. Do let me know if you figure it out better than I did (what is the difference in using a MLP style fully connected dense layer here vs DBN).

An earlier version of the (Wang2018) work is refined and further explored using convolutional neural networks (CNNs) in (Li2017). In this case, a word2vec word-embedding layer is used as the first layer, and trained on the source and AST vocabulary. This is fed into a 1-dimensional CNN, which is one of the popular deep learning network types for text processing. After passing through this part of the network, the output feature vector is merged with a set of the more traditional source metrics (lines of code, etc). These are together merged for the final network layers to do the prediction, which are fed into the final single-node output layer for the probability prediction.

Illustration of this network:

Metrics based model

To address class imbalance (more "clean" than "buggy" files), (Li2017) uses duplication of the minority class instances. They also compare to traditional metrics as well as the DBN from (Wang2018) and DBN+ whichs combines the traditional features with the DBN "semantic" features. As usual for research papers, they report getting better results with the CNN+ version. Everyone seems to do that, yet perfection seems never to be achieved, or even nearly. Strange.

A Thought

The evolution in defect prediction seems to be from traditional classifiers with traditional "hand-crafted" (source metrics) features to deep-learning generated and AST-based features in combination with traditional metrics. Again, I did not see any RNN based deep-learning classifier applications, although I would expect they should be quite well suited for this type of analysis. Perhaps next time.

Traceability Analysis

Despite everyone being all Agile now, heavier processes such as requirements traceability can still be needed. Especially for complex enough systems, and ones with heavy regulatory- or standards-based compliance requirements. Medical, telco, automotive, … In the real world, such traces may not always be documented, and sometimes it is of interest to find them.

A line of work exploring the use of deep learning for automating the generation of traceability links between software artefacts is in (Guo2017, Rath2018). These are from the same major software engineering conference (ICSE) over two following years, with some author overlap. So there is some traceability in this work too, heh-heh (such joke, much non-academic). The first one (Guo2017) aims to link requirements to design and test artefacts in the train control domain. The second one aims to link code submissions to issues/bug reports on Github.

Requirements documents

Using recurrent neural networks (RNN) to link requirements documents to other documents is investigated in (Guo2017). I covered this work to some extend already last year, but lets see if I can add something with what I learned since.

Use cases fot this as mentioned in (Guo2017):

  • Finding new, missing (undocumented) links between artefacts.
  • Train on a set of existing data for existing projects, apply to find links within a new project. This seems like a form of transfer learning, and is not explored in the paper. It focuses on the first bullet.

I find the approach used in (Guo2017) interesting, linking together two recurrent neural network (RNN) layers from parallel input branches for natural language processing (NLP):

Requirements linking NN

There are two identical input branches (top of figure above). One for the requirements documents, and one for the target document for which the link is assessed. Let’s pretend the target is a test document to stay relevant. A pair of documents is fed to different input branches of the network, and the network outputs a probability of these two documents being linked.

In ML you typically try different model configurations and hyperparameters to find what works best. In (Guo2017) they tried different types of layers and parameters. The figure above shows what they found best for their task. See Guo2017) for the experiment details for other parameter values. Here, a bi-directional gated recurrent unit (bi-GRU) layer is used to process each document into a feature vector.

When the requirements document and the target document have been transformed by this to a vector representation, they are fed into a pointwise multiplication layer (mul) and to a vector substraction layer (sub). In Keras this would seem to be a Merge layer with type "mul" or "sub". These merge layers are intended to describe the vector difference direction (mul) and distance (sub) across dimensions. A dense layer with sigmoid activation is used to integrate these two merges, and the final output is given by a 2-neuron softmax layer (linked/not linked probability).

For word-embeddings they try both a domain specific (train-control systems in this case) embedding with 50-dimensions, and a 300-dimensional one combining the domain-specific data with a Wikipedia text dump. They found the domain specific one works better, speculating it to be due to domain-specific use of words.

Since this prediction can produce many different possibilities in a ranked order, simple accuracy of the top choice is not "accurate" itself as an evaluation metric. For evaluating the results, (Guo2017) uses mean average prediction (MAP) as a metric. The MAP achieved in (Guo2017) is reported up to 83.4%. The numbers seem relatively good, although I would need to play with the results to really understand everything in relation to this metric.

An interesting point from (Guo2017) is a way to address class imbalances. The set of requirements and other documents that have valid links they have is a small fraction of the overall set. So the imbalance between the true and false labels is big. They address this by selecting an equal set of true and false labels for an epoch, and switching the set of false label items at the start of each epoch. So all the training data is processes, while a balance is held in each epoch. Nice.

Github Issues

Traceability for linking code commits to bug tracker issues and improvement tickets ("bugs" and "improvement" in project Jira) is presented in (Rath2018). The studied projects are 6 open-source projects written in Java. Unlike the previous study on requirements linking, this study does not use deep-learning based approaches but rather manual feature engineering and more traditional ML classifiers (decision trees, naive bayes, random forest).

This is about mapping issue reports to commits that fix those issues:

Github Issue Linking

Besides more traditional features, (Rath2018) also makes use of time related aspects as extra filtering features. A training set is built by finding commit messages that reference affected issue IDs. The features used include:

  • Timestamp of commit. Has to be later than creation timestamp for potential issue it could be linked to. Has to be inside given timeframe since issue was marked resolved.
  • Closest commit before analyzed commit, and its linked issues.
  • Closest commit after analyzed commit, and its linked issues.
  • Committer id
  • Reporter id
  • Textual similarity of commit source/message and issue text. TF-IDF weighted word- and ngram-counts.

The study in (Rath2018) looks at two different types of analysis for the accuracy of the ML classifier trained. In the first case they try to "reconstruct known links", and in the second case "construct unknown links". They further consider two different scenarios: recommending links for commits, and fully automated link generation. For assistance, their goal is to have the correct link tag in the top 3 suggestions. The automated tagging scenario requires the first predicted tag to be correct.

Not surprisingly, the top 3 approach has better results as it gives the classifier more freedom and leeway. Their results are reported with up to 95%+ recall but with a precision of around 30%. This seems to be in line with what I saw when I tried to build my issue categorization classifier. The first result may not always be correct but many good ones are in the top (and with too many possibilities, even the "correct" one might be a bit ambiguous).

The second use case of constructing previously unknown links sounds to me like it should be very similar in implementation to the first one, but it appears not. The main difference comes from there being large numbers of commits that do not actually address a specific Jira issue or ticket. The canonical example given is a refactoring commit. The obvious (in hindsight) result seems to state you are more likely to find a link if one is known to exist (case 1) vs finding one if it might not exist at all (case 2) :).

A Thought or Two

The point of the requirements linking approach finding the domain-specific word-embeddings better is interesting. In my previous LSTM bug predictor, I found domain specific training helps in similar way, although in that case also combining with the pre-trained word-embeddings worked nicely as well. Of course, I used a large pre-trained Glove embedding for that and did not train on Wikipedia myself. And used Glove vs Word2Vec but I would not expect a big difference.

However, the domain-specific embeddings performance sounds similar to ELMo, Bert, and other recent developments in context-sensitive embeddings. By training only on the domain-specific corpus, you likely get more context-sensitive embeddings for the domain. Maybe the train-control domain in (Guo2017) has more specific vocabulary, or some other attributes that make the smaller domain-specific embedding alone better? Or maybe the type of embedding and its training data makes the difference? No idea. Here’s hoping Elmo style contextual embeddings are made easy to add to Keras models soon, so I can more broadly experiment with those as well. In my obvious summary, I guess it is always better to try different options for different data and models..

Parting Notes

I tried to cover some different aspects of ML applications in software testing. The ones I covered seem to have quite a lot in common. In some sense, they are all mapping documents together. The set of features are also quite common, "traditional" source code metrics along with NLP features. Many specific metrics have also been developed as I listed above, such as modification and modifier (commit author) counts. Deep learning approaches are used to some extent, but it still seems to be making its way in this domain.

Besides what I covered, there are of course other approaches to apply ML to SW testing. I covered some last year, and (Durelli2019) covers much more from an academic perspective. But I found the ones I covered here to be a rather representative set of the ones I consider closest to practical today. If you have further ideas, happy to hear.

In general, I have not seen much of ML applied in meaningful ways to software testing. One approach I have used in past is to use ML as a tool for learning about a test network and its services (Kanstren2017). I am not sure if that really qualifies for a ML application to software testing, since it investigated properties of the test network itself and its services, not the process of testing. Perhaps the generalization of that is in "using machine learning with testing technologies". This would be different from applying ML to testing itself, as well as different from testing ML applications. Have to think about that.

Next I guess I will see if/when I have some time to look at the testing ML applications part. With all the hype on self-driving cars and everything else, that should be interesting.

See, I made this nice but too small text picture of the tree facets of ML and SW Testing I listed above:

Test vs ML facets

References

R. Baghwan et al., "Orca: Differential Bug Localization in Large-Scale Services", 13th USENIX Symposium on Operating Systems Design and Implementation, 2018.

B. Busjaeger, T. Xie, "Learning for Test Prioritization: An Industrial Case Study", 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE), 2016.

N. DiGiuseppe, J.A. Jones, "Concept-Based Failure Clustering", ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering (FSE), 2012.

V. H. S. Durelli et al., "Machine Learning Applied to Software Testing: A Systematic Mapping Study", IEEE Transactions on Reliability, 2019.

X. Gu, H. Zhang, S. Kim, "Deep code search", 40th International Conference on Software Engineering (ICSE), 2018.

J. Guo, J. Cheng, J. Cleland-Huang, "Semantically Enhanced Software Traceability Using Deep Learning Techniques", 39th IEEE/ACM International Conference on Software Engineering (ICSE), 2017.

L. Johnson, et al., "Automatic Localization of Bugs to Faulty Components in Large Scale Software Systems Using Bayesian Classification", IEEE International Conference on Software Quality, Reliability and Security (QRS), 2017.

T. Kanstren, "Experiences in Testing and Analysing Data Intensive Systems", IEEE International Conference on Software Quality, Reliability and Security (QRS, industry track), 2017

M. Kim, et al., "Data Scientists in Software Teams: State of the Art and Challenges", IEEE Transactions on Software Engineering, vol. 44, no. 11, 2018.

A. N. Lam, A. T. Nguyen, H. A. Nguyen, T. N. Nguyen, "Bug Localization with Combination of Deep Learning and Information Retrieval", IEEE International Conference on Program Comprehension, 2017.

J. Li, P. He, J. Zhu, M. R. Lye, "Software Defect Prediction via Convolutional Neural Network", IEEE International Conference on Software Quality, Reliability and Security (QRS), 2017.

X. Li et al., "Unsupervised deep bug report summarization", 26th Conference on Program Comprehension (ICPC), 2018.

R. Malhotra, "A systematic review of machine learning techniques for software fault prediction, Applied Soft Computing, 27,2015.

A. Memon et al., "Taming Google-scale continuous testing", 2017 IEEE/ACM 39th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP), 2017.

M. Rath, J. Rendall, J.L.C Guo, J. Cleland-Huang, P. Mäder, "Traceability in the wild: Automatically Augmenting Incomplete Trace Links", 40th IEEE/ACM International Conference on Software Engineering (ICSE), 2018.

M. Tufano et al., "Deep learning similarities from different representations of source code", 15th International Conference on Mining Software Repositories (MSR), 2018.

S. Wang, T. Liu, J. Nam, L. Tan, "Deep Semantic Feature Learning for Software Defect Prediction", IEEE Transactions on Software Engineering, 2018.

J. Zhou, H. Zhang, D. Lo, "Where should the bugs be fixed? More accurate information retrieval-based bug localization based on bug reports", International Conference on Software Engineering (ICSE), 2012.

Y. Zhou et al, "How Far We Have Progressed in the Journey? An Examination of Cross-Project Defect Prediction", ACM Transactions on Software Engineering and Methodology, no. 1, vol. 27, 2018.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s