Category Archives: testing

Machine Learning in Software Testing

Early 2019 Edition


Software testing has not really changed all that much in the past decades. Machine learning on the other hand is a very rapidly evolving technology being adopted all over the place. So what can it bring to software testing?

Back in 2018 (so about a year ago from now) I did a review of machine learning (ML) (and deep learning (DL)) applications in Network Analysis and Software Testing. After spending some more time learning and trying ML/DL in practice, this is an update on the ML for testing part, reflecting on my own learnings and some developments over this past year. Another interesting part would be testing ML system. I will get to that in another post.

In my last years review, I focused on several topics. A recent academic study (Durelli2019) in this area also lists a number of topics. This includes topics such as "learning test oracles", which basically translates to learning a model of a system behaviour based on some observations or other data about the software behaviour. Last years I included this under the name of specification mining. In practice, I have found such learned behavioral models to be of limited use, and have not seen general uptake anywhere in practice. In this review I focus on fewer topics I find more convincing for practical use.

I illustrate these techniques with this nice pic I made:


In this pic, the "Magic ML Oracle" is just a ML model, or a system using such a model. It learns from a set of inputs during the training phase. In the figure above this could be some bug reports linked to components (file handling, user interface, network, …). In the prediction phase it runs as a classifier, predicting something such as which component a reported issue should be assigned to, how fault-prone an analysis is (e.g., how to focus testing), or how tests and specs are linked (in case of missing links).

The topics I cover mainly relate to using machine learning to analyze various test related artefacts as in the figure above. One example of this is the bug report classifier I built previously. Since most of these ML techniques are quite general, just applied to software testing, ideas from broader ML applications could be useful here as well.

Specifically, software testing is not necessarily that different from other software engineering activities. For example, Microsoft performed an extensive study (Kim2017) on their data scientists and their work in software engineering teams. This work includes bug and performance analysis and prioritization, as well as customer feedback analysis, and various other quality (assurance) related topics.

As an example of concrete ML application to broader SW engineering, (Gu2018) maps natural language queries to source code to enable code search. To train a DL model for this, one recurrent neural network (RNN) based model is built for the code description (from source comments), and another one for the source code. The output of these two is a numerical feature vector. Cosine similarity is a measure used to compare how far apart two such vectors are, and here it is used as the training loss function. This is a nice trick to train a model to map source code constructs to natural language "constructs", enabling mapping short queries to new code in similar ways. It is also nicely described in the morning paper. I see no reason why such queries and mappings would not work for test related searches and code/documents as well. In fact, many of the techniques I list in following sections use quite similar approaches.

Like I said, I am focusing on a smaller set of more practical topics than last year in this still-overly-long post. The overall idea of how to apply these types of techniques in testing in general should not be too different though. This time, I categorize topics to test prioritization bug report localization, defect prediction, and traceability analysis. One section to go over each, one to rule over them all.

Test Prioritization

As (software) organizations and projects grow over time, their codebase tends to grow with them. This would lead to also having more tests to cover that codebase (hopefully…). Running all possible test cases all the time in such a scenario is not always possible or cost-efficient, as it starts to take more and more time and resources. The approach to selecting a subset of the tests to execute has different names: test prioritization, test suite optimization, test minimization, …

The goal with this being to cover as much of the fault-prone areas with fewer tests, such as in this my completely made up image to illustrate the topic:

Prioritized coverage

Consider the coverage % in the above to reflect, for example, covering changes since last time tests were run. Aiming to cover changes did not break anything as fast and efficient as possible.

An industrial test prioritization system used at Google (in 2017) is described in (Memon2017). This does not directly discuss using machine learning (although mentions it as future plan for the data). However, I find it interesting for general data-analysis of testing related data as a basis for test prioritization. It also works to provide a basis for a set of features for ML algorithms, as understanding and tuning your data is really the basis for all ML applications.

The goal in this (Memon2017) case is two-fold: better utilizing test resources (focus on potentially failing tests) and provide feedback to the developers about their commits. The aim is not 100% accurate predictions but rather focusing automated test execution and providing the developer with feedback such as "this commit is 95% more likely to cause breakage due to the code being touched by 5 developers in the past 10 days, and being written in Java". The developers can use this feedback to seek further assurance in additional reviews, more testing, static analysis, and so on.

Some of the interesting features/findings described in (Memon2017):

  • Only about 1% of the tests ever failed during their lifetime.
  • Thus about 99% speedup would be possible if right tests could be identified.
  • Use of dependency distance as a feature: What other component depends on the changed component, and through how many other components
  • Test targets further away from the change are much less likely to fail. So dependency distance seems like a useful prediction (feature) metric. They used a threshold of 10 for their codebase, which might vary by project but the idea likely holds.
  • Files/targets modified more often are more likely to cause breakages.
  • File type affects likelihood of breakage.
  • User/tool id affects likelihood of breakage.
  • Number of developers having worked on single file in a short time affects likelihood of breakage. More developers means higher likelihood to break.
  • The number of test targets affected by a change varies greatly, maybe requiring different treatment.

A similar set of features is presented in (Bhagwan2018):

  • Developer experience: Developer time in the organization and project
  • Code ownership: More developers changing files/components cause more bugs
  • Code hotspots: Specific parts of code that cause issues when changed
  • Commit complexity: Number of changes, changed files, review comments in a single commit. More equals more bugs.

A test prioritization approach taken at is described in (Busjaeger2016). They use five types of features:

  • code coverage
  • text path similarity
  • text content similarity
  • test failure history
  • test age

In (Bhagwan2018), the similarity scores are based on TF-IDF scores and their cosine similarity calculation. TF-IDF simply weights frequency of words in a document against the frequency of the same word in all other documents, to identify most specific terms for document types. The features are fed into a support-vector model to rank tests to execute first. In their (Bhagwan2018) experience, about 3% of overall tests is required to reach about 75% coverage. From the predicted tests, about every 5th is found to be causing a failure.

I find the above provide good examples of data analysis, and basis for defining ML features.

In the (Durelli2019) review, several studies are listed under test prioritization, but these mostly do not strike me as very realistic examples of ML applications. However, one interesting approach is (Spieker2017), which investigates using reinforcement learning for test prioritization. It uses only three features: execution time (duration), last execution (whatever that means..), and failure history. These seem a rather simple set of features to build a complex model, and it seems likely to me that a simple model would also work here. The results in (Spieker2017) are presented as good but not investigated in depth so hard to say from just that. However, I did find the approach to present some interesting ideas in relation to this:

  • Continuous integration systems constantly execute the test suites so you will have a lot of constantly updating data about test suites, execution, results available
  • Continuously updating the model over time based on a last N test runs from past
  • Using a higher exploration rate over full suite to bootstrap the model, lowering over time when it has learned but not setting to zero
  • Using "test case" as model state, and assigning it a priority as an action
  • Listing real "open-data" industry-based datasets to evaluate prioritization ML models on

I would be interested to see how well a simple model, such as Naive Bayes, weighting the previous pass/fails and some pattern over their probability would work. But from the paper it is hard to tell. In any case, the points above would be interesting to explore further.

A Thought (maybe Two)

I assume ML has been applied to test prioritization, just not so much documented. For example, I expect Google would have taken their studies further and used ML for this as discussed in their report (Memon2017). Test prioritization seems like a suitably complex problem, with lots of easily accessible data, and with a clear payoff in sight, to apply ML. The more tests you have, the more you need to execute, the more data you get, the more potential benefit.

In this as in many advanced research topics, I guess the "killer app" might come by integrating all this into a test system / product as a black-box. This would enable everyone to make use of it without requiring to learn all the "ML in test" details outside their core business. Same I guess applies to the other topics I cover in the following sections.

Bug Report Localization

Bug report localization (in this case anyway) involves taking a bug report and finding the component or other part of the software that the report is most likely to concern. Various approaches aim to automate this process by using machine learning algorithms. My previous example is one example of building one.

I made a pretty picture to illustrate this:

Localization Oracle

Typically a bug report is expressed in natural language (at least partially, with code snippets embedded). These are fed to the machine learning classifier (magic oracle in the pic above), which assigns it to 1-N potential components. Component granularity and other details may wary but this is the general idea.

For this, code structural elements used as features include (Tufano2018):

  • sequences of abstract syntax tree (AST) nodes
  • sequences of call-flow-graphs (CFG) nodes
  • bytecode representations. This seems interesting in mapping the code to fewer shared elements (opcodes)

Other features include (Lam2017):

  • camel-case splitting source code (n-grams would seem a natural fit too)
  • time since a file was previously changed when fixing a bug
  • how many bugs overall have been fixed in a file
  • similarity between a bug report and previous bug reports (and what were they assigned to)

Besides using such specific code structures as inputs, also specific pre-processing steps are taken. These include (Tufano2018, Li2018):

  • replacing constant value with their types,
  • splitting camel-case,
  • removing low-level detailed abstract syntax tree (AST) nodes,
  • filtering out methods less than 10 lines long.
  • regular expressions to remove code format characters, and to identify code snippets embedded into the bug report.

An industrial case study on bug localization from Ericsson is presented in (Johnson2016). Topic models built with Latent Dirichlet Allocation (LDA) are learned from the set of bug reports. These are used to assign topic weights to bug reports based on the bug report text. The assigned weights are compared to the learned topic distribution for components, and the higher the match of distributions in the report vs learned component model, the higher the probability to assign the bug report to that component.

Vector Space Model (VSM) was used as a baseline comparison in many cases I found. This is based on TF-IDF scores (vectors) calculated for a document. Similarity between a bug report and source code files in VSM is calculated as a cosine similarity between their TF-IDF vectors. Revised Vector Space Model (rVSM) (Zhou2012) is a refinement of VSM that weights larger documents more, reasoning that bugs are more often found in larger source files. (Zhou2012) also adds weighting from similarity with previous bug reports.

Building on rVSM, (Lam2017) uses an auto-encoder neural network on TF-IDF weighted document terms to map different terms with similar meaning together for more accurate bug localization. Similarly, the "DeepSum" work (Li2018) uses an auto-encoder to summarize bug reports, and to compare their TF-IDF distance with cosine similarity. To me this use of auto-encoders seems like trying to re-invent word-embeddings for no obvious gain, but probably I miss something. After auto-encoding, (Lam2017) combines a set of features using a deep neural network (multi-layer perceptron (MLP) it seems) for final probability evaluation. In any case, word-embedding style mapping of words together in a smaller dimension is found in these works as others.

A Thought or Two

I am a bit surprised not to see much work in applying RNN type networks such as LSTM and GRU into these topics, since they are a great fit for processing textual documents. In my experience they are also quite powerful compared to traditional machine learning methods.

I think this type of bug report localization has practical relevance mainly for big companies with large product teams and customer bases, and complex processes (support levels, etc). This is in domains like telcos, from which the only clear industry report I listed here is from (Ericsson). Something I have found limiting these types of applications in practice is also the need for cross-domain vision to combine these topics and expertise. People seem often quite narrowly focused on specific areas. Black-box integration with common tools might help, again.

Defect Prediction

Software defect prediction refers to predicting which parts of the software are most likely to contain faults. Sometimes this is also referred to as fault proneness analysis. Aim is to provide additional information to help focus testing efforts. This is actually very similar to the bug report localization I discussed above, but with the goal of predicting where currently unknown bugs might appear (vs localizing existing issue reports).

An overall review of this area is presented in (Malhotra2015), showing an extensive use of traditional ML approaches (random forest, decision trees, support vector machines, etc) and traditional source code metrics (lines of code, cyclomatic complexity, etc.) as features. These studies show reasonably good accuracies up from 75% to 93%.

However, another broad review on these approaches and their effectiveness is presented in (Zhou2018). It shows how simply using larger module size to predict higher fault proneness would give equal or better accuracy in many cases. This is my experience from many contexts, keeping it simple is often very effective. But on the other hand, finding that simplicity can be the real challenge, and you can learn a lot by trying different approaches.

More recently, deep learning based approaches have also been applied to this domain. Deep Belief Nets (DBN) are applied in (Wang2018) to generate features from source code AST, and combined with more traditional source code metrics. The presentation on DBNs in (Zhou2018) is a bit unclear to me, but it seems very similar to a MLP. The output of this layer is then termed (as far as I understand) as "semantic feature vector". I looked a bit into the difference of DBN vs MLP, and found some practical discussion at a Keras issue. Make what you will of it. Do let me know if you figure it out better than I did (what is the difference in using a MLP style fully connected dense layer here vs DBN).

An earlier version of the (Wang2018) work is refined and further explored using convolutional neural networks (CNNs) in (Li2017). In this case, a word2vec word-embedding layer is used as the first layer, and trained on the source and AST vocabulary. This is fed into a 1-dimensional CNN, which is one of the popular deep learning network types for text processing. After passing through this part of the network, the output feature vector is merged with a set of the more traditional source metrics (lines of code, etc). These are together merged for the final network layers to do the prediction, which are fed into the final single-node output layer for the probability prediction.

Illustration of this network:

Metrics based model

To address class imbalance (more "clean" than "buggy" files), (Li2017) uses duplication of the minority class instances. They also compare to traditional metrics as well as the DBN from (Wang2018) and DBN+ whichs combines the traditional features with the DBN "semantic" features. As usual for research papers, they report getting better results with the CNN+ version. Everyone seems to do that, yet perfection seems never to be achieved, or even nearly. Strange.

A Thought

The evolution in defect prediction seems to be from traditional classifiers with traditional "hand-crafted" (source metrics) features to deep-learning generated and AST-based features in combination with traditional metrics. Again, I did not see any RNN based deep-learning classifier applications, although I would expect they should be quite well suited for this type of analysis. Perhaps next time.

Traceability Analysis

Despite everyone being all Agile now, heavier processes such as requirements traceability can still be needed. Especially for complex enough systems, and ones with heavy regulatory- or standards-based compliance requirements. Medical, telco, automotive, … In the real world, such traces may not always be documented, and sometimes it is of interest to find them.

A line of work exploring the use of deep learning for automating the generation of traceability links between software artefacts is in (Guo2017, Rath2018). These are from the same major software engineering conference (ICSE) over two following years, with some author overlap. So there is some traceability in this work too, heh-heh (such joke, much non-academic). The first one (Guo2017) aims to link requirements to design and test artefacts in the train control domain. The second one aims to link code submissions to issues/bug reports on Github.

Requirements documents

Using recurrent neural networks (RNN) to link requirements documents to other documents is investigated in (Guo2017). I covered this work to some extend already last year, but lets see if I can add something with what I learned since.

Use cases fot this as mentioned in (Guo2017):

  • Finding new, missing (undocumented) links between artefacts.
  • Train on a set of existing data for existing projects, apply to find links within a new project. This seems like a form of transfer learning, and is not explored in the paper. It focuses on the first bullet.

I find the approach used in (Guo2017) interesting, linking together two recurrent neural network (RNN) layers from parallel input branches for natural language processing (NLP):

Requirements linking NN

There are two identical input branches (top of figure above). One for the requirements documents, and one for the target document for which the link is assessed. Let’s pretend the target is a test document to stay relevant. A pair of documents is fed to different input branches of the network, and the network outputs a probability of these two documents being linked.

In ML you typically try different model configurations and hyperparameters to find what works best. In (Guo2017) they tried different types of layers and parameters. The figure above shows what they found best for their task. See Guo2017) for the experiment details for other parameter values. Here, a bi-directional gated recurrent unit (bi-GRU) layer is used to process each document into a feature vector.

When the requirements document and the target document have been transformed by this to a vector representation, they are fed into a pointwise multiplication layer (mul) and to a vector substraction layer (sub). In Keras this would seem to be a Merge layer with type "mul" or "sub". These merge layers are intended to describe the vector difference direction (mul) and distance (sub) across dimensions. A dense layer with sigmoid activation is used to integrate these two merges, and the final output is given by a 2-neuron softmax layer (linked/not linked probability).

For word-embeddings they try both a domain specific (train-control systems in this case) embedding with 50-dimensions, and a 300-dimensional one combining the domain-specific data with a Wikipedia text dump. They found the domain specific one works better, speculating it to be due to domain-specific use of words.

Since this prediction can produce many different possibilities in a ranked order, simple accuracy of the top choice is not "accurate" itself as an evaluation metric. For evaluating the results, (Guo2017) uses mean average prediction (MAP) as a metric. The MAP achieved in (Guo2017) is reported up to 83.4%. The numbers seem relatively good, although I would need to play with the results to really understand everything in relation to this metric.

An interesting point from (Guo2017) is a way to address class imbalances. The set of requirements and other documents that have valid links they have is a small fraction of the overall set. So the imbalance between the true and false labels is big. They address this by selecting an equal set of true and false labels for an epoch, and switching the set of false label items at the start of each epoch. So all the training data is processes, while a balance is held in each epoch. Nice.

Github Issues

Traceability for linking code commits to bug tracker issues and improvement tickets ("bugs" and "improvement" in project Jira) is presented in (Rath2018). The studied projects are 6 open-source projects written in Java. Unlike the previous study on requirements linking, this study does not use deep-learning based approaches but rather manual feature engineering and more traditional ML classifiers (decision trees, naive bayes, random forest).

This is about mapping issue reports to commits that fix those issues:

Github Issue Linking

Besides more traditional features, (Rath2018) also makes use of time related aspects as extra filtering features. A training set is built by finding commit messages that reference affected issue IDs. The features used include:

  • Timestamp of commit. Has to be later than creation timestamp for potential issue it could be linked to. Has to be inside given timeframe since issue was marked resolved.
  • Closest commit before analyzed commit, and its linked issues.
  • Closest commit after analyzed commit, and its linked issues.
  • Committer id
  • Reporter id
  • Textual similarity of commit source/message and issue text. TF-IDF weighted word- and ngram-counts.

The study in (Rath2018) looks at two different types of analysis for the accuracy of the ML classifier trained. In the first case they try to "reconstruct known links", and in the second case "construct unknown links". They further consider two different scenarios: recommending links for commits, and fully automated link generation. For assistance, their goal is to have the correct link tag in the top 3 suggestions. The automated tagging scenario requires the first predicted tag to be correct.

Not surprisingly, the top 3 approach has better results as it gives the classifier more freedom and leeway. Their results are reported with up to 95%+ recall but with a precision of around 30%. This seems to be in line with what I saw when I tried to build my issue categorization classifier. The first result may not always be correct but many good ones are in the top (and with too many possibilities, even the "correct" one might be a bit ambiguous).

The second use case of constructing previously unknown links sounds to me like it should be very similar in implementation to the first one, but it appears not. The main difference comes from there being large numbers of commits that do not actually address a specific Jira issue or ticket. The canonical example given is a refactoring commit. The obvious (in hindsight) result seems to state you are more likely to find a link if one is known to exist (case 1) vs finding one if it might not exist at all (case 2) :).

A Thought or Two

The point of the requirements linking approach finding the domain-specific word-embeddings better is interesting. In my previous LSTM bug predictor, I found domain specific training helps in similar way, although in that case also combining with the pre-trained word-embeddings worked nicely as well. Of course, I used a large pre-trained Glove embedding for that and did not train on Wikipedia myself. And used Glove vs Word2Vec but I would not expect a big difference.

However, the domain-specific embeddings performance sounds similar to ELMo, Bert, and other recent developments in context-sensitive embeddings. By training only on the domain-specific corpus, you likely get more context-sensitive embeddings for the domain. Maybe the train-control domain in (Guo2017) has more specific vocabulary, or some other attributes that make the smaller domain-specific embedding alone better? Or maybe the type of embedding and its training data makes the difference? No idea. Here’s hoping Elmo style contextual embeddings are made easy to add to Keras models soon, so I can more broadly experiment with those as well. In my obvious summary, I guess it is always better to try different options for different data and models..

Parting Notes

I tried to cover some different aspects of ML applications in software testing. The ones I covered seem to have quite a lot in common. In some sense, they are all mapping documents together. The set of features are also quite common, "traditional" source code metrics along with NLP features. Many specific metrics have also been developed as I listed above, such as modification and modifier (commit author) counts. Deep learning approaches are used to some extent, but it still seems to be making its way in this domain.

Besides what I covered, there are of course other approaches to apply ML to SW testing. I covered some last year, and (Durelli2019) covers much more from an academic perspective. But I found the ones I covered here to be a rather representative set of the ones I consider closest to practical today. If you have further ideas, happy to hear.

In general, I have not seen much of ML applied in meaningful ways to software testing. One approach I have used in past is to use ML as a tool for learning about a test network and its services (Kanstren2017). I am not sure if that really qualifies for a ML application to software testing, since it investigated properties of the test network itself and its services, not the process of testing. Perhaps the generalization of that is in "using machine learning with testing technologies". This would be different from applying ML to testing itself, as well as different from testing ML applications. Have to think about that.

Next I guess I will see if/when I have some time to look at the testing ML applications part. With all the hype on self-driving cars and everything else, that should be interesting.

See, I made this nice but too small text picture of the tree facets of ML and SW Testing I listed above:

Test vs ML facets


R. Baghwan et al., "Orca: Differential Bug Localization in Large-Scale Services", 13th USENIX Symposium on Operating Systems Design and Implementation, 2018.

B. Busjaeger, T. Xie, "Learning for Test Prioritization: An Industrial Case Study", 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE), 2016.

N. DiGiuseppe, J.A. Jones, "Concept-Based Failure Clustering", ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering (FSE), 2012.

V. H. S. Durelli et al., "Machine Learning Applied to Software Testing: A Systematic Mapping Study", IEEE Transactions on Reliability, 2019.

X. Gu, H. Zhang, S. Kim, "Deep code search", 40th International Conference on Software Engineering (ICSE), 2018.

J. Guo, J. Cheng, J. Cleland-Huang, "Semantically Enhanced Software Traceability Using Deep Learning Techniques", 39th IEEE/ACM International Conference on Software Engineering (ICSE), 2017.

L. Johnson, et al., "Automatic Localization of Bugs to Faulty Components in Large Scale Software Systems Using Bayesian Classification", IEEE International Conference on Software Quality, Reliability and Security (QRS), 2017.

T. Kanstren, "Experiences in Testing and Analysing Data Intensive Systems", IEEE International Conference on Software Quality, Reliability and Security (QRS, industry track), 2017

M. Kim, et al., "Data Scientists in Software Teams: State of the Art and Challenges", IEEE Transactions on Software Engineering, vol. 44, no. 11, 2018.

A. N. Lam, A. T. Nguyen, H. A. Nguyen, T. N. Nguyen, "Bug Localization with Combination of Deep Learning and Information Retrieval", IEEE International Conference on Program Comprehension, 2017.

J. Li, P. He, J. Zhu, M. R. Lye, "Software Defect Prediction via Convolutional Neural Network", IEEE International Conference on Software Quality, Reliability and Security (QRS), 2017.

X. Li et al., "Unsupervised deep bug report summarization", 26th Conference on Program Comprehension (ICPC), 2018.

R. Malhotra, "A systematic review of machine learning techniques for software fault prediction, Applied Soft Computing, 27,2015.

A. Memon et al., "Taming Google-scale continuous testing", 2017 IEEE/ACM 39th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP), 2017.

M. Rath, J. Rendall, J.L.C Guo, J. Cleland-Huang, P. M├Ąder, "Traceability in the wild: Automatically Augmenting Incomplete Trace Links", 40th IEEE/ACM International Conference on Software Engineering (ICSE), 2018.

M. Tufano et al., "Deep learning similarities from different representations of source code", 15th International Conference on Mining Software Repositories (MSR), 2018.

S. Wang, T. Liu, J. Nam, L. Tan, "Deep Semantic Feature Learning for Software Defect Prediction", IEEE Transactions on Software Engineering, 2018.

J. Zhou, H. Zhang, D. Lo, "Where should the bugs be fixed? More accurate information retrieval-based bug localization based on bug reports", International Conference on Software Engineering (ICSE), 2012.

Y. Zhou et al, "How Far We Have Progressed in the Journey? An Examination of Cross-Project Defect Prediction", ACM Transactions on Software Engineering and Methodology, no. 1, vol. 27, 2018.


Predicting issue categories on Github

Practical examples of applying machine learning seem to be a bit difficult to find. So I tried to create one for a presentation I was doing on testing and data analytics. I made a review of works in the area, and just chose one for illustrate. This one tries to predict a target category to assign for an issue report. I used ARM mBed OS as a test target since it has issues available on Github and there were some people who work with it attending the presentation.

This demo “service” I created works by first training a predictive model based on a set of previous issue reports. I downloaded the reports from the issue repository. The amount of data available there was so small, I just downloaded the issues manually using the Github API that let me download the data for 100 issues at once. Automating the download should be pretty easy if needed. The amount of data is small, and there are a large number of categories to predict, so not the best for results, but serves as an example to illustrate the concept.

And no, there is no deep learning involved here, so not quite that trendy. I don’t think it is all that necessary for this purpose or this size of data. But could work better of course, if you do try, post the code so we can play as well.

The Github issues API allows me to download the issues in batches. For example, to download page 12 of closed issues, with 100 issues per page, the URL to request is The API seems to cut it down to 100 even if using bigger values than 100. Or I just didn’t quite use it right, whichever. The API docs describe the parameters quite clearly, I downloaded open and closed issues separately, even if I did not use the separation in any meaningful way in the end.

The code here is all in Python. The final classifier/prediction services code is available on my Github repository.

First build a set of stopwords to do some cleaning on the issue descriptions:

	stop_words = set(stopwords.words('english'))
	stop_words = stop_words.union(set(punctuation))
	stop_words.update(["''", "--", "**"])

The above code uses the common NLTK stopwords, a set of punctuation symbols, and a few commonly occurring symbol combinations I found in the data. Since later on I clean it up with another regular expression, probably just the NLTK stopwords would suffice here as well..

To preprocess the issue descriptions before applying machine learning algorightms:

def preprocess_report(body_o):
	#clean issue body text. leave only alphabetical and numerical characters and some specials such as +.,:/\
	body = re.sub('[^A-Za-z0-9 /\\\_+.,:\n]+', '', body_o)
	# replace URL separators with space so the parts of the url become separate words
	body = re.sub('[/\\\]', ' ', body)
	# finally lemmatize all words for the analysis
	lemmatizer = WordNetLemmatizer()
	# text tokens are basis for the features
	text_tokens = [lemmatizer.lemmatize(word) for word in word_tokenize(body.lower()) if word not in stop_words]
	return text_tokens

Above code is intended to remove all but standard alphanumeric characters from the text, remove stop words, and tokenize the remaining text into separate words. It also splits URL’s into parts as separate words. The lemmatization changes known words into their baseforms (e.g., “car” and “cars” become “car”). This just makes it easier for the machine learning algorithm to match words together. Another option is stemming, but lemmatization produces more human-friendly words so I use that.

I stored the downloaded issues as JSON files (as Github API gives) in the data directory. To read all these filenames for processing:

#names of files containing closed and open issues (at time of download)
closed_files = glob.glob("data/*-closed-*")
open_files = glob.glob("data/*-closed-*")

To process those files, I need to pick only the ones with an assigned “component” value. This is what is the training target label. The algorithm is trained to predict this “component” value from the issue description, so without this label, the piece of data is not useful for training.

def process_files(files):
	process the given set of files by collecting issue body text and labels.
	also cleans and lemmatizes the body text

	:param files: names of files to process
	:return: nothing
	global total

	for filename in files:
		with open(filename, encoding="utf-8") as json_data:
			all_issues = json.load(json_data)
			for issue in all_issues:
				labels = issue["labels"]
				for label in labels:
					if label["name"].startswith("component:"):
						name = label["name"]
						body_o = issue["body"]
						text_tokens = preprocess_report(body_o)
						#component_labels are prediction targets
						total += 1

print("total: ", total)

There is a limited number of such labeled data items, as many of the downloaded issues do not have this label assigned. The print at the end of the above code shows the total number of items with the “component” label given, and the number in this dataset is 1078.

Besides removing stop-words and otherwise cleaning up the documents for NLP, combining words sometimes makes sense. Pairs, triplets, and so on are sometimes meaningful. Typical example is words “new” and “york” in a document, versus “new york”. This would be an example of a bi-gram, combining two words into “new_york”. To do this, I use the gensim package:

import gensim

# Build the bigram and trigram models
bigram = gensim.models.Phrases(all_text_tokens, min_count=5, threshold=100) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[all_text_tokens], threshold=100)

# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

#just to see it works

#transform identified word pairs and triplets into bigrams and trigrams
text_tokens = [trigram_mod[bigram_mod[text]] for text in all_text_tokens]

#build whole documents from text tokens. some algorithms work on documents not tokens.
texts = [" ".join(tokens) for tokens in text_tokens]

The above code uses thresholds and minimum co-occurrence counts to avoid combining every possible word with every other possible word. So only word-pairs and triplets that commonly are found to occur together are used (replaced) in the document.

Use the Python data processing library Pandas to turn it into suitable format for the machine learning algorithms:

from pandas import DataFrame

df = DataFrame()

df["texts"] = texts
df["text_tokens"] = text_tokens
df["component"] = component_labels


First to have a look at the data:

#how many issues are there in our data for all the target labels, assigned component counts
value_counts = df["component"].value_counts()
#print how many times each component/target label exists in the training data
#remove all targets for which we have less than 10 training samples.
#K-fold validation with 5 splits requires min 5 to have 1 in each split. This makes it 2, which is still tiny but at least it sorta runs
indices = df["component"].isin(value_counts[value_counts > 9].index)
#this is the weird syntax i never remember, them python tricks. i think it slices the dataframe to remove the items not in "indices" list
df = df.loc[indices, :]

The above code actually already does a bit more. It also filters the dataset to remove the rows with component values that only have less than 10 items. So this is the unfiltered list:

component: tools              162
component: hal                128
component: export             124
component: networking         118
component: drivers            110
component: rtos                88
component: filesystem          80
component: tls                 78
component: docs                60
component: usb                 54
component: ble                 38
component: events              14
component: cmsis               10
component: stdlib               4
component: rpc                  4
component: uvisor               2
component: greentea-client      2
component: compiler             2

And after filtering, the last four rows will have been removed. So in the end, the dataset will not have any rows with labelsl “rpc”, “uvisor”, “greentea-client”, or “compiler”. This is because I will later use stratified 5-fold cross-validation and this requires a minimum of 5 items of each. Filtering with minimum of 10 instances for a label, it is at least possible to have 2 of the least common “component” in each fold.

In a more realistic case, much more data would be needed to cover all categories, and I would also look at possibly combining some of the different categories. And rebuilding the model every now and then, depending on how much effort it is, how much new data comes in, etc.

To use the “component” values as target labels for machine learning, they need to be numerical (integers). This does the transformation:

from sklearn.preprocessing import LabelEncoder

# encode class values as integers
encoder = LabelEncoder()
encoded_label = encoder.fit_transform(df.component)

Just to see how the mapping of integer id’s to labels after label encoding looks:

unique, counts = np.unique(encoded_label, return_counts=True)
print(unique) #the set of unique encoded labels
print(counts) #the number of instances for each label

The result (first line = id, second line = number of items):

[ 0  1  2  3  4  5  6  7  8  9 10 11 12]
[ 38  10  60 110  14 124  80 128 118  88  78 162  54]

Mapping the labels to integers:

#which textual label/component name matches which integer label
le_name_mapping = dict(zip(encoder.classes_, encoder.transform(encoder.classes_)))

#which integer matches which textual label/component name
le_id_mapping = dict(zip(encoder.transform(encoder.classes_), encoder.classes_))

So the first is to print “label: id” pairs, and the second to print “id: label” pairs. The first one looks like this:

'component: ble': 0, 
'component: cmsis': 1, 
'component: docs': 2, 
'component: drivers': 3, 
'component: events': 4, 
'component: export': 5, 
'component: filesystem': 6, 
'component: hal': 7, 
'component: networking': 8, 
'component: rtos': 9, 
'component: tls': 10, 
'component: tools': 11, 
'component: usb': 12

Now, to turn the text into suitable input for a machine learning algorithm, I transform the documents into their TF-IDF presentation. Well, if you go all deep learning with LSTM and the like, this may not be necessary. But don’t take my word for it, I am still trying to figure some of that out.

TF-IDF stands for term frequency (TF) – inverse document frequency (IDF). For example, if the word “bob” appears often in a document, it has a high term frequency for that document. Generally, one might consider such a word to describe that document well (or the concepts in the document). However, if the same word also appears commonly in all the documents (in the “corpus”), it is not really specific to that document, and not very representative of that document vs all other documents in the corpus. So IDF is used to modify the TF so that words that appear often in a document but less often in others in the corpus get a higher weight. And if the word appears often across many documents, it gets a lower weight. This is TF-IDF.

Traditional machine learning approaches also require a more fixed size set of input features. Since documents are of varying length, this can be a bit of an issue. Well, I believe some deep learning models also require this (e.g., CNN), while others less so (e.g., sequential models such as LSTM). Digressing. TF-IDF also (as far as I understand) results in a fixed length feature vector for all documents. Or read this on Stack Overflow and make your own mind up.

Anyway, to my understanding, the feature space (set of all features) after TF-IDF processing becomes the set of all unique words across all documents. Each of these is given a TF-IDF score for each document. For the words that do not exist in a document, the score is 0. And most documents don’t have all words in them, so this results in a very “sparse matrix”, where the zeroes are not really stored. That’s how you can actually process some reasonable sized set of documents in memory.

So, in any case, to convert the documents to TF-IDF presentation:

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5)

#transfor all documents into TFIDF vectors.
#TF-IDF vectors are all same length, same word at same index, value is its TFIDF for that word in that document
features_transformed = vectorizer.fit_transform(features)

Above code fits the vectorizer to the corpus and then transforms all the documents to their TF-IDF representations. To my understanding (from SO), the fit part counts the word occurrences in the corpus, and the transform part uses these overall counts to transform each document into TF-IDF.

It is possible also to print out all the words the TF-IDF found in the corpus:

#the TFIDF feature names is a long list of all unique words found
feature_names = np.array(vectorizer.get_feature_names())

Now to train a classifier to predict the component based on a given document:

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score

kfold = StratifiedKFold(n_splits=5) #5-fold cross validation

#the classifier to use, the parameters are selected based on a set i tried before
clf = RandomForestClassifier(n_estimators=50, min_samples_leaf=1, min_samples_split=5)

results = cross_val_score(clf, features_transformed, encoded_label, cv=kfold)

print("Baseline: %.2f%% (%.2f%%)" % (results.mean() * 100, results.std() * 100))

#fit the classifier on the TFIDF transformed word features, train it to predict the component, encoded_label)
probabilities = clf.predict_proba(features_transformed[0])

In the above I am using RandomForest classifier, with a set of parameters previously tuned. I am also using 5-fold cross validation, meaning the data is split into 5 different parts. The parts are “stratified”, meaning each fold has about the same percentage of each target label as the original set. This is why I removed the labels with less that 10 instances in the beginning, to have at least 2 for each class. Which is till super-tiny but thats what this example is about.

The last part of the code above also runs a prediction on one of the transformed documents just to try it out.

Now, to run predictions on previously unseen documents:

import requests

def predict_component(issue):
	use this to get a set of predictions for a given issue.

	:param issue: issue id from github.
	:return: list of tuples (component name, probability)
	#first load text for the given issue from github
	url = "" + str(issue)
	r = requests.get(url)
	url_json = json.loads(r.content)
	#process the loaded issue data to format matching what the classifier is trained on
	issue_tokens = preprocess_report(url_json["body"])
	issue_tokens = trigram_mod[bigram_mod[issue_tokens]]
	issue_text = " ".join(issue_tokens)
	features_transformed = vectorizer.transform([issue_text])
	#and predict the probability of each component type
	probabilities = clf.predict_proba(features_transformed)
	result = []
	for idx in range(probabilities.shape[1]):
		name = le_id_mapping[idx]
		prob = (probabilities[0, idx]*100)
		prob_str = "%.2f%%" % prob
		print(name, ":", prob_str)
		result.append((name, prob_str))
	return result

This code takes as parameter an issue number for the ARM mBed Github repo. Downloads the issue data, preprocesses it similar to the training data (clean, tokenize, lemmatize, TF-IDF). This is then used as a set of features to predict the component, based on the model trained earlier.

The “predict_component” method/function can then be called from elsewhere. In my case, I made a simple web page to call it. As noted in the beginning of this post, you can find that webserver code, as well as all the code above on my Github repository.

That’s pretty much it. Not very complicated to put some Python lines one after another, but knowing which lines and in which order is perhaps what takes the time to learn. Having someone else around to do it for you if you are a domain expert (e.g., testing, software engineering or similar in this case) is handy, but it can also be useful to have some idea of what happens, or how the algorithms in general work.

Something I left out in all the above was the code to try out different classifiers and their parameters. So I will just put it below for reference.

First a few helper methods:

def top_tfidf_feats(row, features, top_n=25):
	''' Get top n tfidf values in row and return them with their corresponding feature names.'''
	topn_ids = np.argsort(row)[::-1][:top_n]
	top_feats = [(features[i], row[i]) for i in topn_ids]
	df = pd.DataFrame(top_feats)
	df.columns = ['feature', 'tfidf']
	return df

#this prints it for the first document in the set
arr = features_test_transformed[0].toarray()
top_tfidf_feats(arr[0], feature_names)

def show_most_informative_features(vectorizer, clf, n=20):
	feature_names = vectorizer.get_feature_names()
	coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
	top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])
	for (coef_1, fn_1), (coef_2, fn_2) in top:
		print ("\t%.4f\t%-15s\t\t%.4f\t%-15s" % (coef_1, fn_1, coef_2, fn_2))

In above code, “top_tfidf_feats” prints the top words with highest TF-IDF score for a document. So in a sense, it prints the words that TF-IDF has determined to be most uniquely representing that document.

The “show_most_informative_features” prints the top features that a given classifier has determined to be most descriptive/informative for distinguishing target labels. This only works for certain classifiers, which have such simple co-efficients (feature weights). Such as multinomial naive-bayes (MultinomialNB below).

Here is the code to actually try out the classifiers then:

from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB(), labels_train)

from sklearn.metrics import accuracy_score

y_pred = clf.predict(features_test_transformed)
y_true = labels_test
acc_score = accuracy_score(y_true, y_pred)
print("MNB accuracy:"+str(acc_score))

show_most_informative_features(vectorizer, clf)

#try it out on a single document
probabilities = clf.predict_proba(features_test_transformed[0])

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score

#set of parameters to try
estimators = [10, 20, 30, 40, 50]
min_splits = [5, 10, 20, 30, 40, 50]
min_leafs = [1, 2, 5, 10, 20, 30]

kfold = StratifiedKFold(n_splits=5) #5-fold cross validation

best_acc = 0.0
best_rf = None
for estimator in estimators:
	for min_split in min_splits:
		for min_leaf in min_leafs:
			print("estimators=", estimator, "min_split=", min_split, " min_leaf=", min_leaf)

			clf = RandomForestClassifier(n_estimators=estimator, min_samples_leaf=min_leaf, min_samples_split=min_split)

			results = cross_val_score(clf, features_transformed, encoded_label, cv=kfold)

			print("Baseline: %.2f%% (%.2f%%)" % (results.mean() * 100, results.std() * 100))

			if results.mean() > best_acc:
				best_acc = results.mean()
				best_rf = clf
				print("found better:", best_acc, ", ", best_rf)

print("best classifier:")

best_acc = 0
best_rf = None
for estimator in estimators:
	for min_split in min_splits:
		for min_leaf in min_leafs:
			print("estimators=", estimator, "min_split=", min_split, " min_leaf=", min_leaf)

			clf = RandomForestClassifier(n_estimators=estimator, min_samples_leaf=min_leaf, min_samples_split=min_split), labels_train)

			pred = clf.predict(features_test_transformed)

			accuracy = accuracy_score(labels_test, pred)


			if accuracy > best_acc:
				best_acc = accuracy
				best_rf = clf
				print("found better:", best_acc, ", ", best_rf)

In the code above, I use loops to run through the parameters. There is also something called GridSearch in the Python libraries, as well as RandomSearch (for cases where trying all combos is expensive). But I prefer the ability to control the loops, print out whatever I like and all that.

The above code also shows two ways I tried to train/evaluate the RandomForest parameters. First is with k-fold, latter with single test-train split. I picked MultinomialNB and RandomForest because some internet searching gave me the impression they might work reasonably well for unbalanced class sets such as this one. Of course the final idea is always to try and see what works.. This worked quite fine for me. Or so it seems, machine learning seems to be always about iterating stuffs and learning and updating as you go. More data could change this all, or maybe finding some mistake, or having more domain or analytics knowledge, finding mismatching results, or anything really.

What the unbalanced refers to is the number of instances of different components in this dataset, some “components” have many bug repots, while others much less. For many learning algorithms this seems to be an issue. Some searches indicated RandomForest should be fairly robust for this type so this is also one reason I used it.

Running the above code to experiment with the parameters also produced some slightly concerning results. The accuracy for the classifier ranged from 30% to 95% with smallish parameters changes. I would guess that also speaks for the small dataset causing potential issues. Also re-running the same code would give different classifications for new (unseen) instances. Which is what you might expect when I am not setting the randomization seed. But then I would also expect the accuracy to vary somewhat, which it didn’t. So just don’t take this as more than an example of how you might apply ML for some SW testing related tasks. Take it to highlight the need to always learn more, try more things, and get a decent amount of data, evolve models constantly, etc. And post some comments on all the things you think are wrong in this post/code so we can verify the approach of learning and updating all the time :).

In any case, I hope the example is useful for giving an idea of one way how machine learning could be applied in software testing related aspects. Now write me some nice LSTM or whatever is the latest trend in deep learning models, figure out any issues in my code, or whatever, and post some comments. Cheers.