Category Archives: machine learning

Testing Machine Learning Applications (or self-driving cars, apparently)

Introduction

Testing machine learning (ML) based applications, or ML models, requires a different approach from more "traditional" software testing. Traditional software is controlled by manually defined, explicit, control-flow logic. Or to put it differently, someone wrote the code, it can be looked at, read, and sometimes even understood (including what and how to test). ML models, and especially deep learning (DL) models, implement complex functions that are automatically learned from large amounts of data. The exact workings are largely a black box, the potential input space huge, and knowing what and how to exactly test with them is a challenge. In this post I look at what kind of approaches have been taken to test such complex models, and consider what it might mean in the context of broader systems that use such models.

As I was going through the different approaches, and looking for ways in which they address testing ML models in general, I found that most (recent) major works in the area are addressing self-driving / autonomous cars. I guess it makes sense, as the potential market for autonomous cars is very large, and the safety requirements very strict. And it is a hot market. Anyway. So most of the topics I cover will be heavily in the autonomous cars domain, but I will also try to find some generalization where possible.

The Problem Space

Using autonomous cars as examples, most autonomous cars use external cameras as (one of) their inputs. In fact, Tesla seems to use mainly cameras, and built a special chip just to be super-effective at processing camera data with (DL) neural nets. Consider the input space of such cameras and systems as an example of what to test. Here are two figures from near the office where I work:

Dry vs Snow

The figure-pair above is from the same spot on two days, about a week apart. Yes, it is in May (almost summer, eh?), and yes, I am waiting for all your epic offers for a nice job to work on all this in a nice(r) weather. Hah. In any case, just in these two pictures there are many variations visible. Snow/no snow, shadows/no shadows, road markers or not, connecting roads, parking lots, other cars, and so on.

More generally, the above figure-pair is just two different options from a huge number of different possibilities. Just some example variants that come to mind:

environment related:

  • snowing
  • raining
  • snow on ground
  • puddles on the ground
  • sunny
  • cloudy
  • foggy
  • shadows
  • road blocked by construction
  • pedestrians
  • babies crawling
  • cyclists
  • wheelchairs
  • other cars
  • trucks
  • traffic signs
  • regular road markings
  • poor/worn markings
  • connecting roads
  • parking lots
  • strange markings (constructions, bad jokes, …)
  • holes in the road
  • random objects dropped on the road
  • debris flying (plastic bags etc)
  • animals running, flying, standing, walking, …

car related

  • position
  • rotation
  • camera type
  • camera quality
  • camera position
  • other sensors combined with the camera

And so on. Then all the other different places, road shapes, objects, rails, bridges, whatever. This all on top of the general difficulty of reliably identifying even the basic roads, pavements, paths, buildings, etc. in the variety of the real world. Now imagine all the possible combinations of all of the above, and more. And combining those with all the different places, roads, markings, road-signs, movements, etc. Testing all that to some reasonable level of assurance so you would be willing to sign it off to "autonomously drive". Somewhat complicated.

Of course, there are other ML problems in multitude of domains, such as natural language processing, cybersecurity, medical, and everything else. The input space might be slightly more limited in some cases, but typically it grows very big, and has numerous elements and combinations, and their variants.

Neural Nets

In some of the topics later I refer to different types of neural network architectures. Here is just a very brief summary of a few main ones, as a basis, if not familiar.

Fully Connected / Dense / MLP

Fully connected (dense) networks are simply neural nets where every node in each layer is connected to every node in the previous layer. This is a very basic type of network, and illustrated below:

Dense

I took this image off the Internet somewhere, trying to search for it, there seem to be a huge number of hits on it. So No idea of the real origins, sorry. The "dense" part just refers to all neurons in the network being connected to all the other neurons in their neighbouring layers. I guess this is also called a form of multi-layer perceptron (MLP). But I have gotten more used to calling it Dense, so I call it Dense.

The details of this model are not important for this post, so for more info, try the keywords on internet searchs. In this post, it is mainly just important to understand the basic structure, where neuron coverage, or similar references are later mentioned.

CNN

Convolutional neural networks (CNNs) aim to automatically identify local features in data, typically in images. These features start from higher level patterns (e.g., "corners" but can be any abstract for the CNN finds useful) to more compressed ones as the network progresses towards classification from the raw image. The CNNs work by sliding a window (called "kernel" or "filter") across the image, to learn ways to identify those features. The filters (or their weight numbers) are also learned during training. The following animation from Stanford Deeplearning Wiki illustrates a filter window rolling over an image:

CNN

In the above animation, the filter is sliding across the image, which is represented in this case by some numbers of the pixels in the image. CNN networks actually consist of multiple such layers among with other layers. But I will not go into all the details here. For more info, try the Internet.

CNNs can be used for anything that can have similar data expressivity and local features. For example, they are commonly applied to text using 1-dimensional filters. The idea of filters comes up with some coverage criteria leter, which is why this is relevant here.

RNN

Another type of common network type is recurrent neural networks (RNNs). Whereas a common neural network does not consider time aspect, recurrent neural nets loop over data in steps, and use some of the previous steps data as input also as input to following steps. In this sense, it has some "memory" of what it has seen. They typically work better when there is a relevant time-dimension in the data, such as analyzing objects in a driving path over time, or stock market trends. The following figure illustrates a very simplified view of an unrolled RNN with 5 time-steps.

RNN

This example has 5 time-steps, meaning it takes 5 values as input, each following one another in a series (hence, "time-series"). The X1-X5 in the figure are the inputs over time, the H1-H5 are intermediate outputs for each time-step. And as visible in the network, each time-step feeds to the next as well as producing the intermediate output.

Common, more advanced variants of this are Long Short Term Memory (LSTM) and Gated Recurrent Unit (GRU) networks. Again, try some tutorial searches for details.

As with CNN, the relevance here is when some of these concepts (e.g., unrolling) come up with coverage criteria later, and maybe as some background to understand how they work. For example, how a car system might try to predict driving angles based on recent history of visual images it observes over time, not just a single frame.

GAN

A generative adversarial network (GAN) is a two-part network, where one part generates "fake" images, and another part tries to classify them as correct or fake. The idea is simply to first build a classifier that does a good job of identifying certain types (classes) of images. For example, rainy driving scenes. Like so:

Classifier base

Once this classifier works well, you train another network that tries to generate images to fool this classifier. The network that generates the "fake" images is called the generator, and the one that tries to tell the fakes from real is called the discriminator. In a sense the discriminator network provides the loss function for the generator. Meaning it just tell how well the generator is doing, and how to do better. Like so:

Generator NW

So the generator learns to generate realistic fakes by learning to fool the discriminator. In a way, one could describe such a model to "style" another image to a different style. But GANs are used for a variety of image generation and manipulation tasks. When the fake-generator becomes good enough to fool the good classifier, you have a working GAN model. The "adversarial" part refers to the fake-generator being an adversary for the classifier (discriminator) part.

The following image illustrates a real GAN transformation performed by UNIT from summery image (real) to wintery image (fake):

GAN

It is a nice looking transformation, very realistic. In the context of this post, this pair provides very useful test data, as we will see later. Here is what it might look like if it was able to generate the snowy road I put in the beginning of this post vs the dry road on different days:

GAN

So the real world is still a bit more diverse, but the generated figures from UNIT are already very impressive. I have no doubt they can be of use in testing ML system, as I will discuss later in this post. The same approach also works beyond images, as long as you can model the data and domain as a GAN problem. I believe it has been applied to generate sounds, music, text, .. probably much more too.

This GAN information is mainly relevant, for this post, in terms of having some basic understanding of what a GAN is as this comes up in some of the works later in this post. On generating different types of driving scenarios for testing. For more details on the topic, I refer to the mighty Internet again.

Getting (Realistic) Data

A fundamental aspect in applying ML, and especially DL, is getting as much high quality data as you can. Basically, the more the merrier. With more data, you can train bigger and better models, get higher accuracy, run more experiments and tests, and keep improving. Commonly this data also needs "labels" to tell when something is happening in the data, to enable the ML algorithms to learn something useful from it. Such as steering a car in response to the environment.

How to get such data? Autonomous car development is, as usual, a good example here.

Car companies have huge fleets of vehicles on the road, making for great data collection tools. Consider the Tesla Autopilot as an example. At the time of writing this (May 2019), the Tesla Autopilot website describes the system as having:

  • 8 surroung cameras with up to 250M visibility
  • 12 ultrasonic sensors
  • forward facing radar (advertised as "able to see through heavy rain, fog, dust and even the car ahead")

Imagine all this deployed in all their customers cars all the time, recording "free" training data for you. Free as in you as the car manufacturer (or I guess "developer" these days) does not pay people to drive and collect data, but rather the customers pay the manufacturer for the privelege to do so. Potentially millions of customers.

In the Tesla Autonomy day, Andrej Karpathy (famous ML guy, now director of AI at Tesla), describes the Tesla approach as first starting with a manually collected and labeled training set. Followed by using their fleet of cars to identify new scenarios, objects, anything of interest that the system should learn to handle better. Dispatching requests to all their cars out there to provide images and data on such scenarios if/when identified.

They also describe how they use the sensor data across different sensors to train the system better. That is, using the radar measurements to give labels for how far identified objects are, to train the vision system to recognize the distance as well.

In a more traditional software testing terminology, this might be called a form "online testing", or more generally just continous monitoring, learning, and improvement. As I noted above, Tesla even introduced their specialized DL processing chips for their cars called "Tesla HW3", to enable much more efficient and continous processing of such ML models. Such systems will also be able to continously process the data locally to provide experiences on the usefulness of the ML model (compared to user input), and to help build even better "autonomy".

Taking this further, there is something called Tesla Shadow Mode, a system that is enabled all the time when human driver is driving the car themselves or with the help of the autopilot. The (ML) system is constantly trained on all the data collected from all the cars, and runs in parallel when the car is driving. It does not make actual decisions but rather "simulated" ones, and compares them to how the human driver actually performed in the situation. Tesla compares the differences and uses the data to refine and analyze the system to make it better.

This makes for an even more explicit "online testing" approach. The human driver is providing the expected result for the test oracle, while the current autopilot version provides the actual version (output). So this test compares the human driver decisions to the autonomous guidance system decisions all the time, records differences and trains itself to do better from the differences. Test input is all the sensor data from the car.

Beyond Tesla, there seems to be little information to be found on how other Car companies do this type of data collection. I guess maybe because it can be one of the biggest competitive edges to actually have access to the data. However, I did find a description from few years back on the Waymo system for training their autonomous cars. This includes both an advanced simulation system (called CarCraft), as well as a physical site for building various test environments. Data is collected from actual cars driving the roads, new and problematic scenarios are collected and analyzed, modelled and experimented with extensively in the simulations and the physical test environment. Collected data is used to further train the systems, and the loop continues. In this case, the testing includes real-world driving, physical test environments, and simulations. Sounds very similar to the Tesla case, as far as I can tell.

Cybersecurity

What about the rest of the ML world besides autonomous cars? One example that I can provide for broader context is from the realm of cybersecurity. Previously I did some work also in this area, and the cybersecurity companies are also leveraging their information and data collection infrastructure in a similar way. In the sense of using it to build better data analysis and ML models as a basis for their products and services.

This includes various end-point agents such as RSA NetWitness, TrendMicro, Carbon Black, Sophos Intercept, and McAfee. Car companies such as Tesla, Waymo, Baidu, etc. might use actual cars as endpoints to collect data. In a similar way, pretty much all cybersecurity companies are utilizing their deployed infrastructure to collect data, and use that data as inputs to improve their analytics and ML algorithms.

These are further combined with their Threat Intelligence Platforms, such as Carbon Black Threat Analysis, Brightcloud Webroot, AlienValue Open Threat Exchange, and Internet Storm Center. These use the endpoint data as well as the cybersecurity vendors own global monitoring points and network to collect data and feed them to their ML algorithms for training and analysis.

The above relates the cybersecurity aspects to the autonomous car aspects of data collection and machine learning algorithms. What about simulation to augment to augment the data in cybersecurity vs autonomous cars? I guess the input space is a bit different in nature here, as no form of cybersecurity simulation directly comes to mind. I suppose you could also just arrange cyber-excersises and hire hackers / penetration testers etc to provide you with the reference data in attacks against the system. Or maybe my mind is just limited, and there are better approaches to be had ? 🙂

Anyway, in my experience, all you need to do is to deploy a simple server or services on the internet and within minutes (or seconds?) it will start getting hammered with attacks. And it is not like the world is not full of Chinese hackers, Russian hackers, and all the random hackers to give you real data.

I guess this illustrates some difference in domains, where cybersecurity looks more for anomalies, which the adversary might try to hide in different ways in more (semi-)structured data. Autonomous cars mostly need to learn to deal with the complexity of real world, not just anomalies. Of course, you can always "simulate" the non-anomaly case by just observing your cybersecurity infrastructure in general. This is maybe a more realistic scenario (to observe the "normal" state of a deployed information system) to provide a reference than observing a car standing in a parking lot.

In summary, different domains, different considerations, even if the fundamentals remain.

MetaMorphic Testing (MT)

Something that comes up all the time in ML testing is MetaMorphic Testing, sometimes also called property-based testing. The general idea is to describe the software functionality in terms of relations between inputs and outputs, rather than exact specifications of input to output. You take one input which produces an observed output, modify this input, and instead of specifying an exact expected output, you specify a relationship between the original input and related output and the modified input and its related output.

An example of a traditional test oracle might be to input a login username and password, check that it passes with correct inputs, and fails with wrong password. Very exact. Easy and clear to specify and implement. The general examples for metamorphic testing are not very exciting, such as something about sin function. A more intuitive example I have seen is about search engines (Zhou2016). Search engines are typically backed by various natural language processing (NLP) based machine learning approaches, so it is fitting in that regard as well.

As a search-engine example, input the search string "testing" to Google, and observe some set of results. I got 1.3 billion hits (matching documents) as a result when I tried "testing" when writing this. Change the query string to be more specific "metamorphic testing" and the number of results should be fewer than the original query ("testing" vs "metamorphic testing"), since we just made the query more specific.

In my case I got 25k results for "metamorphic testing" query. This is an example of the metamorphic relation: a more specific search term should less or equal number of results as the original query. So if "metamorphic testing" would have resulted in 1.5 billion hits instead of 25k, the test would have failed, since it broke the metamorphic relation we defined. With 25k, it passes as the results are fewer than for the original search quer, and the metamorphic relation holds.

In the following examples, metamorphic testing is applied to many views of testing ML applications. In fact, people working on metamorphic testing must be really excited since this seems like the golden age of applying MT. It seems to be a great match to the type of testing needed for ML, which is all the rage right now.

Test Generation for ML applications

This section is a look at test generation for machine learning applications, focusing on testing ML models. I think the systems using these models as parth of their functionality need to be looked at in a broader context. In the end, they use the models and their predictions for something, which is the reason the system exists. However, I believe looking at testing the ML models is a good start, and complex enough already. Pretty much all of the works I look at next use variants of metamorphic testing.

Autonomous cars

DeepTest in (Tian2018) uses transformations on real images captured from driving cars to produce new images. These are given as input to a metamorphic testing process. The system prediction/model output (e.g., steering angle). The metamorphic relation is to check that the model prediction does not significantly change across transformation of the same basic image. For example, the driving path in sun and rain should be about the same for the same road, in the same environments, with same surrounding traffic (cars etc.). The relation is given some leeway (the outputs do not have to be 100% match, just close), to avoid minimal changes causing unnecessary failures.

The transformations they use in include:

  • brightness change (add/substract constant from pixels)
  • contrast change (multiply pixels by constant)
  • translation (moving/displacing the image by n pixels)
  • scaling the image bigger or smaller
  • shear (tilting the image)
  • rotation (around its center)
  • blur effect
  • fog effect
  • rain effect
  • combinations of all above

The following illustrates the idea of one such transformation on the two road images (from near my work office) from the introduction:

Tilted

For this, I simply rotated the phone a bit, while taking the pics. The arrows are there just as some made-up examples of driving paths that a ML algorithm could predict. Not based on any real algorithm, just an illustrative drawing in this case. In metamorphic testing, you apply such transformations (rotation in this case) on the images, run the ML model and the system using its output to produce driving (or whatever else) instructions, compare to get a similar path before and after the transformation. For real examples, check (Tian2018). I just like to use my own when I can.

My example from above also illustrates a problem I have with the basic approach. This type of transformation would also change the path the car should take, since if the camera is rotated, the car would be rotated too, right? I expect the camera(s) to be fixed in the car, and not changing positions by themselves.

For the other transformations listed above this would not necessarily be an issue, since they do not affect the camera or car position. This is visible in a later GAN based example below. Maybe I missed something, but that is what I think. I believe a more complex definition of the metamorphic relation would be needed than just "path stays the same".

Since there are typically large numbers of input images, applying all transformations above and their combinations to all of images would produce very large input sets with potentially diminishing returns. To address this, (Tian2018) applies a greedy search strategy. Transformations and their combinations are tried on images, and when an image or a transformation pair is found to increase neuron activation coverage, they are selected for further rounds. This iterates until defined ending thresholds (number of experiments).

Coverage in (Tian2018) is measured in terms of activations of different neuros with the input images. For dense networks this is simply the activation value for each neuron. For CNNs they use averages over the convolutional filters. For RNNs (e.g., LSTM/GRU), they unroll the network and use the values for the unrolled neurons.

A threshold of 0.2 is used in their case for the activation measure (neuron activation values range from 0 to 1), even if no rationale is give for this value. I suppose some empirical estimates are in order, in whatever case, and this is what they chose. How these are summed up across the images is a bit unclear from the paper but I guess it would be some kind of combinatorial coverage thing. This is described as defining a set of equivalence partitions based on the similar neuron activations, so maybe there is more to it but I did not quite find it in the paper. An interesting thought in any case, to apply equivalence partitioning based on some metrics over the activation thresholds.

The results (Tian2018) get in terms of coverage increases, issues found etc. are shown as good in the paper. Of course, I always take that with a bit of salt since that is how academic publishing works, but it does sound like a good approach overall.

Extending this with metamorphic test generation using generative adversarial networks (GANs) is studied in (Zhang2018(1)). GANs are trained to generate images with different weather conditions. For example, taking a sunny image of a road, and transforming this into a rainy or foggy image. Metamorphic relations are again defined as the driving guidance should remain the same (or very close) as before this transformation. The argument is that GAN generated weather effects and manipulations are more realistic than more traditional synthetic manipulations for the same (such as in DeepTest from above, and cousins). They use UNIT (Liu2017) toolkit to train and apply the GAN models, using input such as YouTube videos to train.

The following images illustrate this with two images from the UNIT website:

UNIT

On the left in both pairs is the original image, on the right is the GAN (UNIT) transformed one. The pair on the left transformed into night scene, the pair on the right transformed into a rain scene. Again, I drew the arrows myself, and this time they would also make more sense. The only thing changed is the weather, the shape of the road, other cars, and other environmental factors are be the same. So the metamorphic test relation where the path should not change should hold true.

The test oracle in this work is similar to other works (e.g., (Tian2018 above). The predictions for the original and transformed image are compared and should closely match.

Input Validation

A specific step of input validation is also presented in (Zhang2018(1)). Each image analyzed by the automated driving system is first verified in relation to known input data (images it has seen before and was trained on). If the image is classified as significantly different from all the training data previously fed into the network, the system is expected to advise the human driver to take over. The approach for input validation in (Zhang2018(1)) is to project the high-dimensional input data into a smaller dimension, and measure the distance of this projection from the projections of the used training data. If the distance crosses a set threshold, the system alerts the human driver about potentially unknown/unreliable behaviour. Various other approaches to input validation could probably be taken, in general the approach seems to make a lot of sense to me.

LIDAR analysis

Beyond image-based approaches only, a study of applying metamorphic testing on Baidu Apollo autonomous car system is presented in (Zhou2019). In this case, the car has a LIDAR system in place to map its surroundings in detail. The system first identifies a region of interest (ROI), the "drivable" area. Which sounds like a reasonable goal, although I found no process of identifying it in (Zhou2019). This system consists of multiple components:

  • Object segmentation and bounds identification: Find and identify obstacles in ROI
  • Object tracking: Tracking the obstacles (perhaps their movement? hard to say from paper)
  • Sequential type fusion: To smooth the obstacles types over time (I guess to reduce misclassification or lost objects during movement over time)

Their study focuses on the object identification component. The example of a metamorphic relation given in (Zhou2019) is to add new LIDAR points outside the ROI area to an existing point cloud, and run the classifier on both the before- and after-state. The metamorphic relation is again that same obstacles should be identified both before and after adding small amounts of "noise". In relation to the real world, such noise is described as potentially insects, dust, or sensor noise. The LIDAR point cloud data they use is described as having over 100000 samples per point cloud, and over a million per second. The ratio of noise injected is varied with values of 10, 100, and 1000 points. These are considered small numbers compared to the total of over 100k points in each point-cloud (thus "reasonable" noise). Each experiment produces errors or violations of the metamorphic relation in percentages of 2.7%, 12.1%, and 33.5%.

The point cloud is illustrated by this image from the paper:

LIDAR

In the figure above, the green boxes represent detected cars, and the small pink one a pedestrian. The object detection errors found were related to missing some of these types of objects when metamorphic tests were applied. The results they show also classify the problems to different numbers of mis-classifications per object type, which seems like a useful way to report these types of results for further investigation.

(Zhou2019) report discussing with the Baidu Apollo team about their findings, getting acknowledgement for the issues, and considered the MT approach could be useful approach to augment their train and test data. This is in line with what I would expect: ML algorithms are trained and evaluated on the training dataset. If testing finds misclassifications, it makes sense to use the generated test data as a way to improve the training dataset.

This is also visible in the data acquisition part I discussed higher above, in using different means to augment data collected from real-word experiments, real-word test environments, and simulations. MT can function as another approach to this data augmentation, but with an integrated test oracle and data generator.

Other Domains

Besides image classifiers for autonomous cars, ML classifiers have been applied broadly to image classification in other domains as well. A metamorphic testing related approach to these types of more generic images is presented in (Dwarakanath2018). The image transformations applied in this case are:

  • rotate by 90, 180, 270
  • flip images vertically or horizontally
  • switching RGB channels. e.g., RGB becomes GRB
  • normalizing the (test) data
  • scaling the (test) data

The main difference I see in this case is the domain specific limitations of autonomous cars vs general images. If you look to classify random images on the Internet or on Facebook, Instagram, or similar places, you may expect them to arrive in any short of form or shape. You might not expect the car to be driving upside down or to have strangely manipulated color channels. You might expect the car system to use some form of input validation, but that would be before running the input images through the classifier/DL model. So maybe rotation by 90 degrees is not so relevant to test for the autonomous car classifier (expect as part of input validation before the classifier), but it can be of interest to identify a rotated boat in Facebook posts.

Another example of similar ideas is in (Hosseini2017) (probably many others too, …). In this case they apply transformations as negations of the images. Which, I guess, again could be useful to make general classifiers more robust.

An example of applying metamorphic testing to machine learning in the medical domain is presented in (Ding2017). This uses MT to generate variants of existing high-resolution biological cell images. A number of metamorphic relations are defined to check, related to various aspects of the cells (mitochondria etc.) in the images, and the manipulations done to the image. Unlike the autonomous car-related examples, these seem to require in-depth domain expertise and as such I will not go very deep into those (I am no expert in biological cell structures). Although I suppose all domains would need some form of expertise, the need just might be a bit more emphasized in the medical domain.

Generating Test Scenarios

An approach of combining model-based testing, simulation, and metamorphic testing for testing an "AI" (so machine learning..) based flight guidance systems of autonomous drones is presented in (Lindvall2017).

The drone is defined as having a set of sensors (or subset of these):

  • inertial measurement unit
  • barometer
  • GPS
  • cameras
  • LIDAR
  • ultrasonic range finder

Metamorphic relations:

  • behaviour should be similar across similar runs
  • rotation of world coordinates should have no effect
  • coordinate translation: same scenario in different coordinates should have no effect
  • obstacle location: same obstacle in different locations should have same route
  • obstacle formation: similar to location but multiple obstacles together

The above are related modifying the environment. Relations that are used as general test oracle checks where values should always stay within certain bounds:

  • obstacle proximity
  • velocity
  • altitude

MBT is typically used to represent the behaviour of a system as a state-machine. The test generator traverses the state-machine, generating tests as it goes. In this case, the model generates test steps that create different elements (landing pads, obstacler, etc) in the test environemnt. The model state contains the generated environment. After this initial phase, the model generates tests steps related to flying. Lifting off, returning to home base, landing on a landing pad, etc.

I guess the usefulness of this depends on the accuracy of the simulation, as well as how the sensors are used. Overall, it does seem like a good approach to generate environments and scenarios, with various invariant based checks for generic test oracles. For guiding a car based on camera input, maybe not this is not so good. For some sensor type where the input space is well known, and realistic models of the environment and real sensor data are less complex, and easier to create?

Testing ML Algorithms

A large part of the functioniality of ML applications depends on ML frameworks and their implementation of algorithms. Thus their correctness is quite related to the applications, and there are a lot of similar aspects to consider. As noted many times above, it is difficult to exactly know what is the expected result of a ML model, when the "prediction" is correct and so on. This is exactly the same scenario with the implementation of those learning algorithms, as their output is just those models, their training, the predictions they produce, and so on. How do you know what it produced is correct? That the implementation of those models and their training is "correct"?

A generic approach to test anything when you have multiple implementations is to pit those implementations against each other. Meaning to run them all with the same input and configurations, and compare the outputs. In (Pham2019) such an approach is investigated by using Keras as the application programming interface (API), and various backends for it. The following figure illustrates this:

Keras Layers

The basic approach here is to run the different backends for the same models and configurations, and compare results. There are often some differences in the results of those algorithms, even if the implementation is "correct", due to some randomness in the process, the GPU operations used, or some other factor. Thus, (Pham2019) rather compares the distributions and "distances" of the results to see if there are large deviations in one of the multiple implementations compared to others.

This topic is explored a bit further in (Zhang2018(2)), where issues (bug reports etc) against Tensorflow are explored. The issues are accessed via the project Github issue tracker, and related questions on StackOverflow. Some of these remind me of the topics listed by Karpathy’s recent post on model training "gotchas".

One from (Zhang2018) that I have found especially troublesome myself is "stochastic execution", referring to different runs with same configurations, models, and data producing differing results. In my experience, even with fixing all the seed values, exact reproducibility can be an challenge. At least in Kaggle competitions where aiming for every slightest gain possible :). In any case, when testing the results of trained ML algorithms across iterations, I have found exact reproducability can be an issue in general, which this refers to.

Robustness

In the above sections, I have discussed the general testing of ML applications (models). This is the viewpoint of using techniques such as metamorphic testing to generate numerous variants of real scenarios, and the metamorphic relations as a test oracle. This does not address the functionality of the system from a viewpoint of a malicious adversary. Or generally of "stranger" inputs. In machine learning research, this is usually called "adversary input" (Goodfellow2018).

An example from (Goodfellow2018) is to fool an autonomous car (surprise) to misclassify a stop sign and potentially lead to an accident or other issues. The following example illutrates this from the web-version of the (Goodfellow2018) article:

Adversarial example

The stop sign on the left is the original, the one on the right is the adversarial one. I don’t see much of a difference, but a ML classifier can be confused by embedding specific tricky patterns into the image.

Such inputs are designed to look the same to a human observer, but modified just slightly in ways that fool the ML classifier. Typically this is done by embedding some patterns invisible to the human eye into the data, that the classifier picks up. These are types of patterns not present in the regular training data, which is why this is not visible during normal training and test. Many such attacks have been found, and ways to design them developed. A taxonomy in (Goodfellow2018) explores the possible ways, difficulty levels vs adversarial goals. The basic goal starting from reducing confidence (sometimes enough to satisfy adversarial needs) to full misclassification (can’t go more wrong than that).

The "bullet-proof" way to create such inputs as desfribed in (Goodfellow2018) is to have full knowledge of the model, its (activation) weights, and input space. To create the descired misclassification, the desired output (misclassification) is first defined, followed by solving the required constraints in the model backwards to the input. This reminds me of the formal methods vs testing discussions that have been ongoing for decades in the "traditional" software engineering side. I have never seen anyone apply format methods in practice (I do know for safety-critical it is used and very important), but can see their benefit in really safety critical applications. I guess self-driving (autonomous) cars would qualify as one. And as such this type of approach to provide further assurance on the systems would be very benefical.

I would expect that the wide-spread practical adoption of such a approach would, however, require this type of analysis to be provided as a "black-box" solution for most to use. I don’t see myself writing a solver that solve a complex DL network backwards. But I do see the benefits, and with reasonable expertise required could use it. I had trouble following the explanation and calculations in (Goodfellow2018) and believe many could be willing to pay for the tools and services where available (and if having the funding, which I guess should be there if developing something like autonomous cars). As long as I could trust and believe in the tool/service quality.

This topic is quite extensively studied and just searching for adversarial machine learning on the Internets gives plenty of hits. One recent study on the topic is (Elsayed2018) (also with Goodfellow..). It shows some interesting numbers on how many of the adversarial inputs fool the classifier and how many fool a human. The results do not show 100% accurate fail so perhaps that and the countermeasures discussed in (Goodfellow2018) and similar papers, such as verifying expected properties of input, could make for a plausible way to implement useful input validation strategies against this, similar to ones I discussed earlier in this post.

Test Coverage for ML applications

Test coverage is one of the basic building blocks of software testing. If you ever want to measure something about your test efforts, test coverage is likely to be at the very top of the list. It is then quite natural to start thinking about measuring test coverage in terms of ML/DL models and their applications. However, similar to ML testing, test coverage in ML is a bit more complicated topic.

Traditional software consists of lines of code, written to explicitly implement a specific logic. It makes sense to set a goal of having high test coverage to cover (all) the lines of code you write. The code is written for a specific reason, and you should be able to write a test for that functionality. Of course, real-world is not that simple, and personally, I have never seen 100% code coverage achieved. Of course, there are other criteria, such as requirements coverage as well, that can also be mapped in different ways to code and other artefacts.

As discussed throughout this post, the potential input space for ML models is huge. How do you evaluate the coverage of any tests over such models? For example, two common models referenced in (Ma2018) are VGG-19 and Resnet-50, with 16000 and 94000 neurons over 25 and 176 layers, respectively. Other models such as recent NLP models from OpenAI have much more. Combined with the potential input space, that is quite a few combinations to cover everything.

Coverage criteria for neural nets are explored in (Ma2018), (Sun2018), (Li2019) (and I am sure other places..). As far as I can see, they all take quite a similar approach to coverage measurement and analysis. They take a set of the valid input data and use it to train and test the model. Then measure various properties of the neural nets, the neuron activations, layer activations, etc. as a basis for coverage. Some examples of coverage metrics applied:

(Ma2018)

  • k-multisection: splits the neuron activation value range to k-sections and measures how many are covered
  • boundary: measures explicitly boundary (top/bottom) coverage of the activation function values
  • strong boundary: same as boundary but only regarding the top value
  • top-k: how many neurons have been the top-k active ones in a layer
  • top-k patterns: pair-wise combination patterns over top-k

(Sun2018)

  • neurons: neurons that have achieved their activation threshold
  • modified condition/decision coverage: different inputs (neurons) causing activation value of this neuron to change
  • boundary: activation values below/above set top/bottom boundary values

The approaches typically use the results to guide where to generate more inputs. Quite often this just seems to be another way to generate further adversarial inputs, or to identify where the adversarial inputs could be improved (increased coverage).

The results in (Ma2018) also indicate how different ML models for the same data require different adversarial inputs to trigger the desired output (coverage). This seems like a potentially useful way to guard against such adversarial inputs as well, by using such "coverage" measures to help select diverse sets of models to run and compare.

The following figure is a partial snippet from (Ma2018), showing some MNIST data:

MNIST

MNIST is a dataset with images of written letters and digits. In this case, a number of adversarial input generators have been applied on the original input. As visible, the adversarial inputs are perturbed from the original, and in ways which do actually not seem too strange, simply somewhat "dirty" in this case. I can see how robustness could be improved by including these, as in the real world the inputs are not perfect and clean either.

Another viewpoint is presented in (Kim2019), using the term surprise adequacy, to describe how "surprised" the DL system would be based on specific inputs. This surprise is measured in terms of new added coverage over the model. The goal is to produce as "surprising input" as possible, to increase coverage. I assume withing limits of realistic inputs, of course. In practice this seems to translate to traversing new "paths" in the model (activations), or distance of activation values from previous activation values.

Some criticism on these coverage criteria is presented in (Li2019). Mainly with regards to whether such coverage criteria can effectively capture the relevant adversary inputs due to the extensive input and search space. However, I will gladly take any gain I can get.

It seems to me the DNN coverage analysis as well as its use are still work in progress, as the field is finding its way. Similary to metamorphics testing, this is also probably a nice time to publish papers, as shown by all the ones I listed from just the past year here. Hopefully in the following years we will see nice practical tools and methods to make this easier from application perspective.

Final Thoughts

Many of the techniques I discussed in the above sections are not too difficult to implement. The machine learning and deep learning frameworks and tools have come a long way and provide good support for building classifiers with combinations of different models (e.g., CNN, RNN, etc). I have played with these using Keras and Tensorflow as a backend, and there are some great online reasources available to learn from. Of course, the bigger datasets and problems do require extensive efforts and resources to collect and fine-tune.

As a basis for considering the ML/DL testing aspects more broadly, some form of a machine learning/deep learning defect model would seem like a good start. There was some discussion on different defect types in (Zhang2018(2)) from the viewpoint of classifying existing defects. From the viewpoint of this post, it would be interesting to see some related to

  • testing and verification needs,
  • defect types in actual use,
  • how the different types such as adversarial inputs relate to this,
  • misclassifications,
  • metamorphic relations
  • domain adaptations

This type of model would help more broadly analyze and apply testing techniques to real systems and their needs.

Since I have personally spent time working on model-based testing, the idea of applying MBT with MT for ML as in (Lindvall2017) to create new scenarios in combination with MT techniques seems interesting to me. Besides testing some ML algorithms on their own, this could help build more complex overall scenarios.

Related to this, the approaches I described in this post, seem to be mostly from a relatively static viewpoint. I did not see consideration for the changing sensor (data collection) environments, which seems potentially important. For example:

  • What happens when you change the cameras in your self-driving car? Add/Remove?
  • The resolution of other sensors? Radar? LIDAR? Add new sensors?
  • How can you leverage your existing test data to provide assurance with a changed overall "data environment"?

The third one would be a more general question, the first two would change with domains. I expect testing such changes would benefit from metamorphic testing, where the change being tested is from the old to the new configuration, and relations measure the change in the ultimate goal (e.g., driving guidance).

Data augmentation is a term used in machine learning to describe new and possibly synthetic data added to existing model training data to improve model performance. As I already discussed in some parts of this post above, it can be tempting to just consider the data generated by metamorphic testing as more training data. I think this can make sense, but I would not just blindly add all possible data, but rather focus the testing part of evaluating the robustness and overall performance of the model. Keeping that in mind as adding more data, to be able to still independently (outside the training data) perform as much verification as possible. And not add all possible data if it shows no added benefit in testing.

Overall, I would look forward to getting more advanced tools to apply the techniques discussed here. Especially in areas of:

  • metamorphic testing and related application to ML models and related checks
  • GAN application frameworks, on top of other frameworks like Keras or otherwise easily integratable in other tools
  • adversarial input generation tools using different techniques
  • job offers 🙂

Its an interesting field to see how ML applications develop, autonomous cars and everything else. Now I should get back to Kaggle and elsewhere to build some mad skills in this constantly evolving domain, eh.. 🙂

References

A. Dwarakanath et al., "Identifying implementation bugs in machine learning based image classifiers using metamorphic testing", 27th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA), 2018

J. Ding, X-H. Hu, V. Gudivada, "A Machine Learning Based Framework for Verification and Validation of Massive Scale Image Data", IEEE Transactions on Big Data, 2017.

G. F. Elsayed et al., "Adversarial Examples that Fool both Computer Vision and Time-Limited Humans", 32nd Conference on Neural Information Processing Systems (NeurIPS), 2018.

I. Goodfellow, P. McDaniel, N. Papernot, "Making Machine Learning Robust Against Adversarial Inputs", Communications of the ACM, vol. 61, no. 7, 2018.

H. Hosseini et al., "On the Limitation of Convolutional Neural Networks in Recognizing Negative Images", 6th IEEE International Conference on Machine Learning and Applications, 2017.

J. Kim, R. Feldt, S. Yoo, "Guiding Deep Learning System Testing using Surprise Adequacy", International Conference on Software Engineering (ICSE), 2019.

Z. Li et al., "Structural Coverage Criteria for Neural Networks Could Be Misleading", International Conference on Software Engineering (ICSE), 2019.

M. Lindvall et al., "Metamorphic Model-based Testing of Autonomous Systems", IEEE/ACM 2nd International Workshop on Metamorphic Testing (MET), 2017.

M-Y. Liu, T. Breuel, J. Kautz, "Unsupervised Image-to-Image Translation Networks", Advances in Neural Information Processing Systems (NIPS), 2017.

L. Ma et al., "DeepGauge: Multi-Granularity Testing Criteria for Deep Learning Systems", 33rd ACM/IEEE International Conference on Automated Software Engineering (ASE), 2018.

H.V. Pham, T. Lutellier, W. Qi, L. Tan, "CRADLE : Cross-Backend Validation to Detect and Localize Bugs in Deep Learning Libraries", International Conference on Software Engineering (ICSE), 2019.

Y. Sun et al., "Concolic testing for deep neural networks", 33rd ACM/IEEE International Conference on Automated Software Engineering (ASE), 2018

Y. Tian et al., "DeepTest: Automated Testing of Deep-Neural-Network-driven Autonomous Cars", 40th International Conference on Software Engineering (ICSE), 2018.

M. Zhang (1) et al., "DeepRoad: GAN-Based Metamorphic Testing and Input Validation Framework for Autonomous Driving Systems", 33rd ACM/IEEE International Conference on Automated Software Engineering (ASE), 2018.

Y. Zhang (2) et al., "An Empirical Study on TensorFlow Program Bugs", 27th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA), 2018.

Z. Q. Zhou, S. Xiang, T. Y. Chen, "Metamorphic Testing for Software Quality Assessment : A Study of Search Engines", IEEE Transactions on Software Engineering, vol. 42, no. 3, 2016.

Z. Q. Zhou, L. Sun, "Metamorphic Testing of Driverless Cars", Communications of the ACM, vol. 62, no. 3, 2019.

Machine Learning in Software Testing

Early 2019 Edition

Introduction

Software testing has not really changed all that much in the past decades. Machine learning on the other hand is a very rapidly evolving technology being adopted all over the place. So what can it bring to software testing?

Back in 2018 (so about a year ago from now) I did a review of machine learning (ML) (and deep learning (DL)) applications in Network Analysis and Software Testing. After spending some more time learning and trying ML/DL in practice, this is an update on the ML for testing part, reflecting on my own learnings and some developments over this past year. Another interesting part would be testing ML system. I will get to that in another post.

In my last years review, I focused on several topics. A recent academic study (Durelli2019) in this area also lists a number of topics. This includes topics such as "learning test oracles", which basically translates to learning a model of a system behaviour based on some observations or other data about the software behaviour. Last years I included this under the name of specification mining. In practice, I have found such learned behavioral models to be of limited use, and have not seen general uptake anywhere in practice. In this review I focus on fewer topics I find more convincing for practical use.

I illustrate these techniques with this nice pic I made:

Topics

In this pic, the "Magic ML Oracle" is just a ML model, or a system using such a model. It learns from a set of inputs during the training phase. In the figure above this could be some bug reports linked to components (file handling, user interface, network, …). In the prediction phase it runs as a classifier, predicting something such as which component a reported issue should be assigned to, how fault-prone an analysis is (e.g., how to focus testing), or how tests and specs are linked (in case of missing links).

The topics I cover mainly relate to using machine learning to analyze various test related artefacts as in the figure above. One example of this is the bug report classifier I built previously. Since most of these ML techniques are quite general, just applied to software testing, ideas from broader ML applications could be useful here as well.

Specifically, software testing is not necessarily that different from other software engineering activities. For example, Microsoft performed an extensive study (Kim2017) on their data scientists and their work in software engineering teams. This work includes bug and performance analysis and prioritization, as well as customer feedback analysis, and various other quality (assurance) related topics.

As an example of concrete ML application to broader SW engineering, (Gu2018) maps natural language queries to source code to enable code search. To train a DL model for this, one recurrent neural network (RNN) based model is built for the code description (from source comments), and another one for the source code. The output of these two is a numerical feature vector. Cosine similarity is a measure used to compare how far apart two such vectors are, and here it is used as the training loss function. This is a nice trick to train a model to map source code constructs to natural language "constructs", enabling mapping short queries to new code in similar ways. It is also nicely described in the morning paper. I see no reason why such queries and mappings would not work for test related searches and code/documents as well. In fact, many of the techniques I list in following sections use quite similar approaches.

Like I said, I am focusing on a smaller set of more practical topics than last year in this still-overly-long post. The overall idea of how to apply these types of techniques in testing in general should not be too different though. This time, I categorize topics to test prioritization bug report localization, defect prediction, and traceability analysis. One section to go over each, one to rule over them all.

Test Prioritization

As (software) organizations and projects grow over time, their codebase tends to grow with them. This would lead to also having more tests to cover that codebase (hopefully…). Running all possible test cases all the time in such a scenario is not always possible or cost-efficient, as it starts to take more and more time and resources. The approach to selecting a subset of the tests to execute has different names: test prioritization, test suite optimization, test minimization, …

The goal with this being to cover as much of the fault-prone areas with fewer tests, such as in this my completely made up image to illustrate the topic:

Prioritized coverage

Consider the coverage % in the above to reflect, for example, covering changes since last time tests were run. Aiming to cover changes did not break anything as fast and efficient as possible.

An industrial test prioritization system used at Google (in 2017) is described in (Memon2017). This does not directly discuss using machine learning (although mentions it as future plan for the data). However, I find it interesting for general data-analysis of testing related data as a basis for test prioritization. It also works to provide a basis for a set of features for ML algorithms, as understanding and tuning your data is really the basis for all ML applications.

The goal in this (Memon2017) case is two-fold: better utilizing test resources (focus on potentially failing tests) and provide feedback to the developers about their commits. The aim is not 100% accurate predictions but rather focusing automated test execution and providing the developer with feedback such as "this commit is 95% more likely to cause breakage due to the code being touched by 5 developers in the past 10 days, and being written in Java". The developers can use this feedback to seek further assurance in additional reviews, more testing, static analysis, and so on.

Some of the interesting features/findings described in (Memon2017):

  • Only about 1% of the tests ever failed during their lifetime.
  • Thus about 99% speedup would be possible if right tests could be identified.
  • Use of dependency distance as a feature: What other component depends on the changed component, and through how many other components
  • Test targets further away from the change are much less likely to fail. So dependency distance seems like a useful prediction (feature) metric. They used a threshold of 10 for their codebase, which might vary by project but the idea likely holds.
  • Files/targets modified more often are more likely to cause breakages.
  • File type affects likelihood of breakage.
  • User/tool id affects likelihood of breakage.
  • Number of developers having worked on single file in a short time affects likelihood of breakage. More developers means higher likelihood to break.
  • The number of test targets affected by a change varies greatly, maybe requiring different treatment.

A similar set of features is presented in (Bhagwan2018):

  • Developer experience: Developer time in the organization and project
  • Code ownership: More developers changing files/components cause more bugs
  • Code hotspots: Specific parts of code that cause issues when changed
  • Commit complexity: Number of changes, changed files, review comments in a single commit. More equals more bugs.

A test prioritization approach taken at Salesforce.com is described in (Busjaeger2016). They use five types of features:

  • code coverage
  • text path similarity
  • text content similarity
  • test failure history
  • test age

In (Bhagwan2018), the similarity scores are based on TF-IDF scores and their cosine similarity calculation. TF-IDF simply weights frequency of words in a document against the frequency of the same word in all other documents, to identify most specific terms for document types. The features are fed into a support-vector model to rank tests to execute first. In their (Bhagwan2018) experience, about 3% of overall tests is required to reach about 75% coverage. From the predicted tests, about every 5th is found to be causing a failure.

I find the above provide good examples of data analysis, and basis for defining ML features.

In the (Durelli2019) review, several studies are listed under test prioritization, but these mostly do not strike me as very realistic examples of ML applications. However, one interesting approach is (Spieker2017), which investigates using reinforcement learning for test prioritization. It uses only three features: execution time (duration), last execution (whatever that means..), and failure history. These seem a rather simple set of features to build a complex model, and it seems likely to me that a simple model would also work here. The results in (Spieker2017) are presented as good but not investigated in depth so hard to say from just that. However, I did find the approach to present some interesting ideas in relation to this:

  • Continuous integration systems constantly execute the test suites so you will have a lot of constantly updating data about test suites, execution, results available
  • Continuously updating the model over time based on a last N test runs from past
  • Using a higher exploration rate over full suite to bootstrap the model, lowering over time when it has learned but not setting to zero
  • Using "test case" as model state, and assigning it a priority as an action
  • Listing real "open-data" industry-based datasets to evaluate prioritization ML models on

I would be interested to see how well a simple model, such as Naive Bayes, weighting the previous pass/fails and some pattern over their probability would work. But from the paper it is hard to tell. In any case, the points above would be interesting to explore further.

A Thought (maybe Two)

I assume ML has been applied to test prioritization, just not so much documented. For example, I expect Google would have taken their studies further and used ML for this as discussed in their report (Memon2017). Test prioritization seems like a suitably complex problem, with lots of easily accessible data, and with a clear payoff in sight, to apply ML. The more tests you have, the more you need to execute, the more data you get, the more potential benefit.

In this as in many advanced research topics, I guess the "killer app" might come by integrating all this into a test system / product as a black-box. This would enable everyone to make use of it without requiring to learn all the "ML in test" details outside their core business. Same I guess applies to the other topics I cover in the following sections.

Bug Report Localization

Bug report localization (in this case anyway) involves taking a bug report and finding the component or other part of the software that the report is most likely to concern. Various approaches aim to automate this process by using machine learning algorithms. My previous example is one example of building one.

I made a pretty picture to illustrate this:

Localization Oracle

Typically a bug report is expressed in natural language (at least partially, with code snippets embedded). These are fed to the machine learning classifier (magic oracle in the pic above), which assigns it to 1-N potential components. Component granularity and other details may wary but this is the general idea.

For this, code structural elements used as features include (Tufano2018):

  • sequences of abstract syntax tree (AST) nodes
  • sequences of call-flow-graphs (CFG) nodes
  • bytecode representations. This seems interesting in mapping the code to fewer shared elements (opcodes)

Other features include (Lam2017):

  • camel-case splitting source code (n-grams would seem a natural fit too)
  • time since a file was previously changed when fixing a bug
  • how many bugs overall have been fixed in a file
  • similarity between a bug report and previous bug reports (and what were they assigned to)

Besides using such specific code structures as inputs, also specific pre-processing steps are taken. These include (Tufano2018, Li2018):

  • replacing constant value with their types,
  • splitting camel-case,
  • removing low-level detailed abstract syntax tree (AST) nodes,
  • filtering out methods less than 10 lines long.
  • regular expressions to remove code format characters, and to identify code snippets embedded into the bug report.

An industrial case study on bug localization from Ericsson is presented in (Johnson2016). Topic models built with Latent Dirichlet Allocation (LDA) are learned from the set of bug reports. These are used to assign topic weights to bug reports based on the bug report text. The assigned weights are compared to the learned topic distribution for components, and the higher the match of distributions in the report vs learned component model, the higher the probability to assign the bug report to that component.

Vector Space Model (VSM) was used as a baseline comparison in many cases I found. This is based on TF-IDF scores (vectors) calculated for a document. Similarity between a bug report and source code files in VSM is calculated as a cosine similarity between their TF-IDF vectors. Revised Vector Space Model (rVSM) (Zhou2012) is a refinement of VSM that weights larger documents more, reasoning that bugs are more often found in larger source files. (Zhou2012) also adds weighting from similarity with previous bug reports.

Building on rVSM, (Lam2017) uses an auto-encoder neural network on TF-IDF weighted document terms to map different terms with similar meaning together for more accurate bug localization. Similarly, the "DeepSum" work (Li2018) uses an auto-encoder to summarize bug reports, and to compare their TF-IDF distance with cosine similarity. To me this use of auto-encoders seems like trying to re-invent word-embeddings for no obvious gain, but probably I miss something. After auto-encoding, (Lam2017) combines a set of features using a deep neural network (multi-layer perceptron (MLP) it seems) for final probability evaluation. In any case, word-embedding style mapping of words together in a smaller dimension is found in these works as others.

A Thought or Two

I am a bit surprised not to see much work in applying RNN type networks such as LSTM and GRU into these topics, since they are a great fit for processing textual documents. In my experience they are also quite powerful compared to traditional machine learning methods.

I think this type of bug report localization has practical relevance mainly for big companies with large product teams and customer bases, and complex processes (support levels, etc). This is in domains like telcos, from which the only clear industry report I listed here is from (Ericsson). Something I have found limiting these types of applications in practice is also the need for cross-domain vision to combine these topics and expertise. People seem often quite narrowly focused on specific areas. Black-box integration with common tools might help, again.

Defect Prediction

Software defect prediction refers to predicting which parts of the software are most likely to contain faults. Sometimes this is also referred to as fault proneness analysis. Aim is to provide additional information to help focus testing efforts. This is actually very similar to the bug report localization I discussed above, but with the goal of predicting where currently unknown bugs might appear (vs localizing existing issue reports).

An overall review of this area is presented in (Malhotra2015), showing an extensive use of traditional ML approaches (random forest, decision trees, support vector machines, etc) and traditional source code metrics (lines of code, cyclomatic complexity, etc.) as features. These studies show reasonably good accuracies up from 75% to 93%.

However, another broad review on these approaches and their effectiveness is presented in (Zhou2018). It shows how simply using larger module size to predict higher fault proneness would give equal or better accuracy in many cases. This is my experience from many contexts, keeping it simple is often very effective. But on the other hand, finding that simplicity can be the real challenge, and you can learn a lot by trying different approaches.

More recently, deep learning based approaches have also been applied to this domain. Deep Belief Nets (DBN) are applied in (Wang2018) to generate features from source code AST, and combined with more traditional source code metrics. The presentation on DBNs in (Zhou2018) is a bit unclear to me, but it seems very similar to a MLP. The output of this layer is then termed (as far as I understand) as "semantic feature vector". I looked a bit into the difference of DBN vs MLP, and found some practical discussion at a Keras issue. Make what you will of it. Do let me know if you figure it out better than I did (what is the difference in using a MLP style fully connected dense layer here vs DBN).

An earlier version of the (Wang2018) work is refined and further explored using convolutional neural networks (CNNs) in (Li2017). In this case, a word2vec word-embedding layer is used as the first layer, and trained on the source and AST vocabulary. This is fed into a 1-dimensional CNN, which is one of the popular deep learning network types for text processing. After passing through this part of the network, the output feature vector is merged with a set of the more traditional source metrics (lines of code, etc). These are together merged for the final network layers to do the prediction, which are fed into the final single-node output layer for the probability prediction.

Illustration of this network:

Metrics based model

To address class imbalance (more "clean" than "buggy" files), (Li2017) uses duplication of the minority class instances. They also compare to traditional metrics as well as the DBN from (Wang2018) and DBN+ whichs combines the traditional features with the DBN "semantic" features. As usual for research papers, they report getting better results with the CNN+ version. Everyone seems to do that, yet perfection seems never to be achieved, or even nearly. Strange.

A Thought

The evolution in defect prediction seems to be from traditional classifiers with traditional "hand-crafted" (source metrics) features to deep-learning generated and AST-based features in combination with traditional metrics. Again, I did not see any RNN based deep-learning classifier applications, although I would expect they should be quite well suited for this type of analysis. Perhaps next time.

Traceability Analysis

Despite everyone being all Agile now, heavier processes such as requirements traceability can still be needed. Especially for complex enough systems, and ones with heavy regulatory- or standards-based compliance requirements. Medical, telco, automotive, … In the real world, such traces may not always be documented, and sometimes it is of interest to find them.

A line of work exploring the use of deep learning for automating the generation of traceability links between software artefacts is in (Guo2017, Rath2018). These are from the same major software engineering conference (ICSE) over two following years, with some author overlap. So there is some traceability in this work too, heh-heh (such joke, much non-academic). The first one (Guo2017) aims to link requirements to design and test artefacts in the train control domain. The second one aims to link code submissions to issues/bug reports on Github.

Requirements documents

Using recurrent neural networks (RNN) to link requirements documents to other documents is investigated in (Guo2017). I covered this work to some extend already last year, but lets see if I can add something with what I learned since.

Use cases fot this as mentioned in (Guo2017):

  • Finding new, missing (undocumented) links between artefacts.
  • Train on a set of existing data for existing projects, apply to find links within a new project. This seems like a form of transfer learning, and is not explored in the paper. It focuses on the first bullet.

I find the approach used in (Guo2017) interesting, linking together two recurrent neural network (RNN) layers from parallel input branches for natural language processing (NLP):

Requirements linking NN

There are two identical input branches (top of figure above). One for the requirements documents, and one for the target document for which the link is assessed. Let’s pretend the target is a test document to stay relevant. A pair of documents is fed to different input branches of the network, and the network outputs a probability of these two documents being linked.

In ML you typically try different model configurations and hyperparameters to find what works best. In (Guo2017) they tried different types of layers and parameters. The figure above shows what they found best for their task. See Guo2017) for the experiment details for other parameter values. Here, a bi-directional gated recurrent unit (bi-GRU) layer is used to process each document into a feature vector.

When the requirements document and the target document have been transformed by this to a vector representation, they are fed into a pointwise multiplication layer (mul) and to a vector substraction layer (sub). In Keras this would seem to be a Merge layer with type "mul" or "sub". These merge layers are intended to describe the vector difference direction (mul) and distance (sub) across dimensions. A dense layer with sigmoid activation is used to integrate these two merges, and the final output is given by a 2-neuron softmax layer (linked/not linked probability).

For word-embeddings they try both a domain specific (train-control systems in this case) embedding with 50-dimensions, and a 300-dimensional one combining the domain-specific data with a Wikipedia text dump. They found the domain specific one works better, speculating it to be due to domain-specific use of words.

Since this prediction can produce many different possibilities in a ranked order, simple accuracy of the top choice is not "accurate" itself as an evaluation metric. For evaluating the results, (Guo2017) uses mean average prediction (MAP) as a metric. The MAP achieved in (Guo2017) is reported up to 83.4%. The numbers seem relatively good, although I would need to play with the results to really understand everything in relation to this metric.

An interesting point from (Guo2017) is a way to address class imbalances. The set of requirements and other documents that have valid links they have is a small fraction of the overall set. So the imbalance between the true and false labels is big. They address this by selecting an equal set of true and false labels for an epoch, and switching the set of false label items at the start of each epoch. So all the training data is processes, while a balance is held in each epoch. Nice.

Github Issues

Traceability for linking code commits to bug tracker issues and improvement tickets ("bugs" and "improvement" in project Jira) is presented in (Rath2018). The studied projects are 6 open-source projects written in Java. Unlike the previous study on requirements linking, this study does not use deep-learning based approaches but rather manual feature engineering and more traditional ML classifiers (decision trees, naive bayes, random forest).

This is about mapping issue reports to commits that fix those issues:

Github Issue Linking

Besides more traditional features, (Rath2018) also makes use of time related aspects as extra filtering features. A training set is built by finding commit messages that reference affected issue IDs. The features used include:

  • Timestamp of commit. Has to be later than creation timestamp for potential issue it could be linked to. Has to be inside given timeframe since issue was marked resolved.
  • Closest commit before analyzed commit, and its linked issues.
  • Closest commit after analyzed commit, and its linked issues.
  • Committer id
  • Reporter id
  • Textual similarity of commit source/message and issue text. TF-IDF weighted word- and ngram-counts.

The study in (Rath2018) looks at two different types of analysis for the accuracy of the ML classifier trained. In the first case they try to "reconstruct known links", and in the second case "construct unknown links". They further consider two different scenarios: recommending links for commits, and fully automated link generation. For assistance, their goal is to have the correct link tag in the top 3 suggestions. The automated tagging scenario requires the first predicted tag to be correct.

Not surprisingly, the top 3 approach has better results as it gives the classifier more freedom and leeway. Their results are reported with up to 95%+ recall but with a precision of around 30%. This seems to be in line with what I saw when I tried to build my issue categorization classifier. The first result may not always be correct but many good ones are in the top (and with too many possibilities, even the "correct" one might be a bit ambiguous).

The second use case of constructing previously unknown links sounds to me like it should be very similar in implementation to the first one, but it appears not. The main difference comes from there being large numbers of commits that do not actually address a specific Jira issue or ticket. The canonical example given is a refactoring commit. The obvious (in hindsight) result seems to state you are more likely to find a link if one is known to exist (case 1) vs finding one if it might not exist at all (case 2) :).

A Thought or Two

The point of the requirements linking approach finding the domain-specific word-embeddings better is interesting. In my previous LSTM bug predictor, I found domain specific training helps in similar way, although in that case also combining with the pre-trained word-embeddings worked nicely as well. Of course, I used a large pre-trained Glove embedding for that and did not train on Wikipedia myself. And used Glove vs Word2Vec but I would not expect a big difference.

However, the domain-specific embeddings performance sounds similar to ELMo, Bert, and other recent developments in context-sensitive embeddings. By training only on the domain-specific corpus, you likely get more context-sensitive embeddings for the domain. Maybe the train-control domain in (Guo2017) has more specific vocabulary, or some other attributes that make the smaller domain-specific embedding alone better? Or maybe the type of embedding and its training data makes the difference? No idea. Here’s hoping Elmo style contextual embeddings are made easy to add to Keras models soon, so I can more broadly experiment with those as well. In my obvious summary, I guess it is always better to try different options for different data and models..

Parting Notes

I tried to cover some different aspects of ML applications in software testing. The ones I covered seem to have quite a lot in common. In some sense, they are all mapping documents together. The set of features are also quite common, "traditional" source code metrics along with NLP features. Many specific metrics have also been developed as I listed above, such as modification and modifier (commit author) counts. Deep learning approaches are used to some extent, but it still seems to be making its way in this domain.

Besides what I covered, there are of course other approaches to apply ML to SW testing. I covered some last year, and (Durelli2019) covers much more from an academic perspective. But I found the ones I covered here to be a rather representative set of the ones I consider closest to practical today. If you have further ideas, happy to hear.

In general, I have not seen much of ML applied in meaningful ways to software testing. One approach I have used in past is to use ML as a tool for learning about a test network and its services (Kanstren2017). I am not sure if that really qualifies for a ML application to software testing, since it investigated properties of the test network itself and its services, not the process of testing. Perhaps the generalization of that is in "using machine learning with testing technologies". This would be different from applying ML to testing itself, as well as different from testing ML applications. Have to think about that.

Next I guess I will see if/when I have some time to look at the testing ML applications part. With all the hype on self-driving cars and everything else, that should be interesting.

See, I made this nice but too small text picture of the tree facets of ML and SW Testing I listed above:

Test vs ML facets

References

R. Baghwan et al., "Orca: Differential Bug Localization in Large-Scale Services", 13th USENIX Symposium on Operating Systems Design and Implementation, 2018.

B. Busjaeger, T. Xie, "Learning for Test Prioritization: An Industrial Case Study", 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE), 2016.

N. DiGiuseppe, J.A. Jones, "Concept-Based Failure Clustering", ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering (FSE), 2012.

V. H. S. Durelli et al., "Machine Learning Applied to Software Testing: A Systematic Mapping Study", IEEE Transactions on Reliability, 2019.

X. Gu, H. Zhang, S. Kim, "Deep code search", 40th International Conference on Software Engineering (ICSE), 2018.

J. Guo, J. Cheng, J. Cleland-Huang, "Semantically Enhanced Software Traceability Using Deep Learning Techniques", 39th IEEE/ACM International Conference on Software Engineering (ICSE), 2017.

L. Johnson, et al., "Automatic Localization of Bugs to Faulty Components in Large Scale Software Systems Using Bayesian Classification", IEEE International Conference on Software Quality, Reliability and Security (QRS), 2017.

T. Kanstren, "Experiences in Testing and Analysing Data Intensive Systems", IEEE International Conference on Software Quality, Reliability and Security (QRS, industry track), 2017

M. Kim, et al., "Data Scientists in Software Teams: State of the Art and Challenges", IEEE Transactions on Software Engineering, vol. 44, no. 11, 2018.

A. N. Lam, A. T. Nguyen, H. A. Nguyen, T. N. Nguyen, "Bug Localization with Combination of Deep Learning and Information Retrieval", IEEE International Conference on Program Comprehension, 2017.

J. Li, P. He, J. Zhu, M. R. Lye, "Software Defect Prediction via Convolutional Neural Network", IEEE International Conference on Software Quality, Reliability and Security (QRS), 2017.

X. Li et al., "Unsupervised deep bug report summarization", 26th Conference on Program Comprehension (ICPC), 2018.

R. Malhotra, "A systematic review of machine learning techniques for software fault prediction, Applied Soft Computing, 27,2015.

A. Memon et al., "Taming Google-scale continuous testing", 2017 IEEE/ACM 39th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP), 2017.

M. Rath, J. Rendall, J.L.C Guo, J. Cleland-Huang, P. Mäder, "Traceability in the wild: Automatically Augmenting Incomplete Trace Links", 40th IEEE/ACM International Conference on Software Engineering (ICSE), 2018.

M. Tufano et al., "Deep learning similarities from different representations of source code", 15th International Conference on Mining Software Repositories (MSR), 2018.

S. Wang, T. Liu, J. Nam, L. Tan, "Deep Semantic Feature Learning for Software Defect Prediction", IEEE Transactions on Software Engineering, 2018.

J. Zhou, H. Zhang, D. Lo, "Where should the bugs be fixed? More accurate information retrieval-based bug localization based on bug reports", International Conference on Software Engineering (ICSE), 2012.

Y. Zhou et al, "How Far We Have Progressed in the Journey? An Examination of Cross-Project Defect Prediction", ACM Transactions on Software Engineering and Methodology, no. 1, vol. 27, 2018.

Attention for Bug Predictions

Previously I wrote about building an issue category predictor using LSTM networks on Keras. This was a two-layer bi-directional LSTM network. A neural network architecture that has been gaining some "attention" recently in NLP is Attention. This is simply an approach to have the network pay some more "attention" to specific parts of the input. That’s what I think anyway. So the way to use Attention layers is to add them to other existing layers.

In this post, I look at adding Attention to the network architecture of my previous post, and how this impacts the resulting accuracy and training of the network. Since Keras still does not have an official Attention layer at this time (or I cannot find one anyway), I am using one from CyberZHG’s Github. Thanks for the free code!

Network Configurations

I tried a few different (neural) network architectures with Attention, including the ones from my previous post, with and without Glove word embeddings. In addition to these, I tried with adding a dense layer before the final output layer, after the last attention layer. Just because I head bigger is better :). The maximum model configuration of this network looks like this:

attention model

With a model summary as:

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
Input (InputLayer)           (None, 1000)              0         
_________________________________________________________________
embedding (Embedding)        (None, 1000, 300)         6000000   
_________________________________________________________________
lstm-bi1 (Bidirectional)     (None, 1000, 256)         440320    
_________________________________________________________________
drop1 (Dropout)              (None, 1000, 256)         0         
_________________________________________________________________
seq_self_attention_3 (SeqSel (None, 1000, 256)         16449     
_________________________________________________________________
lstm-bi2 (Bidirectional)     (None, 1000, 128)         164864    
_________________________________________________________________
drop2 (Dropout)              (None, 1000, 128)         0         
_________________________________________________________________
seq_weighted_attention_3 (Se (None, 128)               129       
_________________________________________________________________
mid_dense (Dense)            (None, 128)               16512     
_________________________________________________________________
drop3 (Dropout)              (None, 128)               0         
_________________________________________________________________
output (Dense)               (None, 165)               21285     
=================================================================
Total params: 6,659,559
Trainable params: 659,559
Non-trainable params: 6,000,000
_________________________________________________________________

I reduced from this for a few different configurations to try what would be the impact on the loss and accuracy of the predictions.

I used 6 variants of this. Each had the 2 bi-direction LSTM layers. Variants of this:

  • 2-att-d: each bi-lstm followed by attention. dropout of 0.5 and 0.3 after each bi-lstm
  • 2-att: each bi-lstm followed by attention, no dropout
  • 2-att-d2: each bi-lstm followed by attention, dropout of 0.2 and 0.1 after each bi-lstm
  • 2-att-dd: 2-att-d2 with dense in the end, dropouts 0.2, 0.1, 0.3
  • 1-att-d: 2 bi-directional layers, followed by single attention. dropout 0.2 and 0.1 after each bi-lstm.
  • 1-att: 2 bi-directional layers, followed by single attention. no dropout.

The code for the model definition with all the layers enabled:

    input = Input(shape=(sequence_length,), name="Input")
    embedding = Embedding(input_dim=vocab_size, 
                          weights=[embedding_matrix],
                          output_dim=embedding_dim, 
                          input_length=sequence_length,
                          trainable=embeddings_trainable,
                          name="embedding")(input)
    lstm1_bi1 = Bidirectional(CuDNNLSTM(128, return_sequences=True, name='lstm1'), name="lstm-bi1")(embedding)
    drop1 = Dropout(0.2, name="drop1")(lstm1_bi1)
    attention1 = SeqSelfAttention(attention_width=attention_width)(drop1)
    lstm2_bi2 = Bidirectional(CuDNNLSTM(64, return_sequences=True, name='lstm2'), name="lstm-bi2")(attention1)
    drop2 = Dropout(0.1, name="drop2")(lstm2_bi2)
    attention2 = SeqWeightedAttention()(drop2)
    mid_dense = Dense(128, activation='sigmoid', name='mid_dense')(attention2)
    drop3 = Dropout(0.2, name="drop3")(mid_dense)
    output = Dense(cat_count, activation='sigmoid', name='output')(drop3)
    model = Model(inputs=input, outputs=output)

Results

The 3 tables below summarize the results for the different model configurations, using the different embeddings versions of:

  • Glove initialized embeddings, non-trainable
  • Glove initialized embeddings, trainable
  • Uninitialized embeddings, trainable

Non-trainable glove:

model epoch max-accuracy min-loss
2-att-d 9 0.434 2.3655
2-att 8 0.432 2.4323
2-att 7 0.430 2.3974
2-att-d2 8 0.435 2.3466
2-att-dd 9 0.423 2.4046
1-att 8 0.405 2.3132
1-att-d 8 0.410 2.4817

Trainable glove:

model epoch max-accuracy min-loss
2-att-d 6 0.537 1.9735
2-att 5 0.540 1.9940
2-att 4 0.533 1.9643
2-add-d2 6 0.529 2.0279
2-add-d2 5 0.528 1.9852
2-add-dd 7 0.505 2.2565
1-att 7 0.531 2.0869
1-att 5 0.530 1.9927
1-att-d 6 0.526 2.0180

Trainable uninitialized:

model epoch max-accuracy min-loss
2-att-d 7 0.510 2.1198
2-att-d 6 0.502 2.106
2-att 7 0.522 2.0915
2-att-d2 8 0.504 2.2104
2-att-d2 6 0.490 2.1684
2-att-dd 9 0.463 2.4869
1-att 8 0.510 2.1413
1-att 7 0.505 2.1138
1-att-d 8 0.517 2.1552

The training curves in general look similar to this (picked from one of the best results):

attention results

So not too different from my previous results.

At least with this type of results, it is nice to see a realistic looking training + validation accuracy and loss curve, with training going up and crossing validation at some point close to where overfitting starts. I have recently done a lot of Kaggle with the intent to learn how to use all these things in practice. And I think Kaggle is really great place for this. However, the competitions seem to be geared to trick you somehow, the given training vs test sets are usually really weirdly skewed, and the competition on getting those tiny fractions of accuracy great. So compared to that this is a nice, fresh view on the real world being more sane for ML application. But I digress..

Summary insights from the above:

  • The tables above show how adding Attention does increase the accuracy by about 10%, from bit below 50% to about 54% in the best case.
  • Besides the impect from adding Attention, the rest of the configurations seem merely fiddling with it, without much impact on accuracy or loss.
  • 2 Attention layers are slightly better than one
  • Dropout quite consistently has a small negative impact on performance. As I did not try that many configurations, maybe it could be improved by tuning.
  • Final dense layer here mainly has just a negative impact
  • The Kaggle kernels I ran this with have this nasty habit of sometimes cutting the output for some parts of the code. In this case it consistently cut the un-trainable Glove version output at around 9th epoch, which is why all those in the above tables are listed as best around 8th or 9th epoch. It might have shown small gains for one or two more epochs still. However, it was plateauing so strong already I do not think it is a big issue for these results.
  • Due to training with smaller batch size taking longer, I had to limit epochs to 10 from previous post’s 15. On the other hand, Attention seems to converge faster, so not that bad a tradeoff.

Attention: My Final Notes

I used the Attention layer from the Github I linked. Very nice work in that Github in many related accounts BTW. Definitely worth checking out. This layer seems very useful. however, it seems to be tailored to the Github owners specific needs and not documented in much detail There seem to be some different variants of Attention layers I found around the interents, some only working on previous versions of Keras, others requiring 3D input, others only 2D.

For example, the above Attention layer works on the 3D inputs. SeqSelfAttention takes input as 3D sequences, outputs 3D sequences. SeqWeightedAttention takes input as 3D, outputs 2D. There is at least one implementation being copy-pasted around in Kaggle kernels that uses 2D inputs and outputs. Some other custom Keras layers seem to have gone stale. Another I found on Github seems promising but has not been updated. One of the issues links to a patched version though. But in any case, my goal was not to compare different custom implementations, so I will just wait for the final and play with this for now.

As noted, I ran these experiments on Kaggle kernels. At the time they were running on NVidia P100 GPU’s, which are intended to be datacenter scale products. These have 16GB GPU memory, which at this time is a lot. Using the two attention layers I described above, I managed to exhaust this memory quite easily. This is maybe because I used a rather large sequence length of 1000 timesteps (words). The model summary I printed above shows the Attention layers having only 16449 and 129 parameters to train, so the implementation must otherwise require plenty of space. Not that I understand the details at such depth, but something to consider.

Some of the errors I got for setting up these Attention layers also seemed to indicate it was building a 4D representation by adding another layer (of size 1000) on top of the layer it was paying attention to (the bi-LSTM in this case). This sort of makes sense, considering if it takes a 3D input (as LSTM sequence output) and pays attention to it. This attention window is just one parameter that could be tuned in this Attention implementation I used, so a better understanding of this implementation and its tuning parameters/options/impacts would likely be useful and maybe help with many of my issues.

Overall, as far as I understand, using a smaller number of timesteps is quite common. Likely using fewer would give very good results still but allow for more freedom to experiment with other parts of the model architecture without runnign out of memory. The memory issue required me to run with a much smaller batch size of 48 down from 128 and higher from before. This has yet again the effect of slowing performance as with smaller batch size takes longer to process the whole dataset.

"Official" support for Attention has been a long time coming (well, in terms of DL frameworks anyway..), and seems to be quite awaited feature (so the search-engines tell me). The comments I link above (in the Keras issue tracker on Github) also seem to contain various proposals for implementation. Perhaps the biggest issue still being the need to figure out how the Keras team wants to represent Attention to users, and how to make it as easy to use (and I suppose effective) as possible. Still, over these years of people waiting, maybe it would be nice to have something and build on that? Of course, as a typical OSS customer, I expect to have all this for free, so that is my saltmine.. 🙂

Some best practices / easy to understand documentation I would like to see:

  • Tradeoffs in using different types of Attention: 3D, 2D, attention windows, etc.
  • Attention in multi-layer architectures, where does it make the most sense and why (intuitively too)
  • Parameter explanations and tuning experiences / insights (e.g., attention window size)
  • Other general use types in different types of networks

Bug Report Classifier with LSTM on Keras

I previously did a review on applications of machine learning in software testing and network analysis. I was looking at updating that, maybe with some extra focus. As usual, I got distracted. This time to build an actual system to do some of the tasks discussed in those reviews. This post discusses how I built a bug report classifier based on bugreport descriptions. Or more generally, they are issues listed in a public Jira, but nevermind..

The classifier I built here is based on bi-directional LSTM (long short-term memory) networks using Keras (with Tensorflow). So deep learning, recurrent neural networks, word embeddings. Plenty of trendy things to see here.

Getting some data

The natural place to go looking for this type of data is open source projects and their bug data bases. I used the Qt project bug tracker (see, even the address has word "bug" in it, not "issue"). It seems to be based on the commonly used Jira platform. You can go to the web site and select some project, fill in filters, click on export, and directly get a CSV formatted output file that can be simply imported into Pandas and thus Python ML and data analytics libraries. This is what I did.

Since the export interface only allows for downloading data for 1000 reports at once, I scripted it. Using Selenium Webdriver I automated filling in download filters one month at a time. This script is stored in my ML-experiments Github. Along with a script that combines all the separate downloads into one CSV file. Hopefully I don’t move these around the repo too much and keep breaking the links.

Some points in building such a downloader:

  • Disable save dialog in browser via Selenium settings
  • Autosave on
  • Wait for download to complete by scanning partial files or otherwise
  • Rename latest created file according to filtered time
  • Check filesizes, bug creation dates and index continuity to see if something was missing

Exploring the data

Before running in full speed to build some classifier, it is generally good to explore the data a bit and see what it looks like, what can be learned, etc. Commonly this is called exploratory data analysis (EDA).

First, read the data and set dates to date types to enable date filters:

df_bugs = pd.read_csv("bugs/reduced.csv", 
             parse_dates=["Created", "Due Date", "Resolved"])
print(df_bugs.columns)

This gives 493 columns:

Index(['Unnamed: 0', 'Affects Version/s', 'Affects Version/s.1',
       'Affects Version/s.10', 'Affects Version/s.11', 
       'Affects Version/s.12',
       'Affects Version/s.13', 'Affects Version/s.14', 
       'Affects Version/s.15', 'Affects Version/s.16',
       ...
       'Status', 'Summary', 'Time Spent', 'Updated', 'Votes',
       'Work Ratio', 'Σ Original Estimate', 'Σ Remaining Estimate',
       'Σ Time Spent'], dtype='object', length=493)

This is a lot of fields for a bug report. The large count is because Jira seems to dump multiple valued items into multiple columns. The above snippet shows an example of "Affects version" being split to multiple columns. If one bug has at most 20 affected versions, then all exported rows will have 20 columns for "Affects version". So one item per column if there are many. A simple way I used to combine them was by the count of non-null values:

#this dataset has many very sparse columns, 
#where each comment is a different field, etc.
#this just sums all these comment1, comment2, comment3, ... 
#as a count of such items
def sum_columns(old_col_name, new_col_id):
	old_cols = [col for col in all_cols if old_col_name in col]
	olds_df = big_df[old_cols]
	olds_non_null = olds_df.notnull().sum(axis=1)
	big_df.drop(old_cols, axis=1, inplace=True)
	big_df[new_col_id] = olds_non_null

#just showing two here as an example
sum_columns("Affects Version", "affects_count")
sum_columns("Comment", "comment_count")
...
print(df_bugs.columns)
Index(['Unnamed: 0', 'Assignee', 'Created', 'Creator', 
       'Description',
       'Due Date', 'Environment', 'Issue Type', 'Issue id', 
       'Issue key',
       'Last Viewed', 'Original Estimate', 'Parent id', 'Priority',
       'Project description', 'Project key', 'Project lead', 
       'Project name', 'Project type', 'Project url', 
       ''Remaining Estimate', 'Reporter',
       'Resolution', 'Resolved', 'Security Level', 'Sprint', 
       'Sprint.1',
       'Status', 'Summary', 'Time Spent', 'Updated', 'Votes', 
       'Work Ratio', 'Σ Original Estimate', 
       'Σ Remaining Estimate', 'Σ Time Spent',
       'outward_count', 'custom_count', 'comment_count', 
       'component_count',
       'labels_count', 'affects_count', 'attachment_count',
       'fix_version_count', 'log_work_count'],
      dtype='object')

So, that would be 45 columns after combining several of the counts. Down from 493, and makes it easier to find bugs with most votes, comments, etc. This enables views such as:

df_bugs.sort_values(by="Votes", ascending=False)
        [["Issue key", "Summary", "Issue Type", 
          "Status", "Votes"]].head(10)

most voted

In a similar way, bug type counts:

order = ['P0: Blocker', 'P1: Critical', 'P2: Important',
         'P3: Somewhat important','P4: Low','P5: Not important',
         'Not Evaluated']

df_2019["Priority"].value_counts().loc[order]
       .plot(kind='bar', figsize=(10,5))

issue priorities

In addition, I ran various other summaries and visualizations on it to get a bit more familiar with the data.

The final point was to build a classifier and see how well that does. A classifier needs a classification target. I went with the assigned component. So, my classifier tries to predict the component to assign bug report to, using only the bug reports natural language description.

To start with, a look at the components. A bug report in this dataset can be assigned to multiple components. Similar to the "Affects version" above.

The distribution looks like this:

df_2019["component_count"].value_counts().sort_index()
1     64499
2      5616
3       596
4        64
5        10
6         5
7         1
8         3
9         1
10        3
11        1

This shows having and issue assigned to more than 2 components being rare, and more than 3 very rare. For this experiment, I only collected the first two components the bugs were assigned to (if any). Of those, I simply used the first assigned component as the training target label. Some further data could be had by adding training set items with labels also for second and third components. Or for all of them if feeling like it. But the first component served good enough for this experiment.

How many unique ones are in those first two?

values = set(df_2019["comp1"].unique())
values.update(df_2019["comp2"].unique())
len(values)
172

So there would be 172 components to predict. And how does their issue count distribution look like?

counts = df_2019["comp1"].value_counts()
counts
1.  QML: Declarative and Javascript Engine     5260
2.  Widgets: Widgets and Dialogs               4547
3.  Documentation                              3352
4.  Quick: Core Declarative QML                2037
5.  Qt3D                                       1928
6.  QPA: Other                                 1898
7.  Build tools: qmake                         1862
8.  WebEngine                                  1842
9.  Packaging & Installer                      1803
10. Build System                               1801
11. Widgets: Itemviews                         1534
12. GUI: Painting                              1480
13. Multimedia                                 1478
14. GUI: OpenGL                                1462
15. Quick: Controls                            1414
16. GUI: Text handling                         1378
17. Core: I/O                                  1265
18. Device Creation                            1255
19. Quick: Controls 2                          1173
20. GUI: Font handling                         1141
...
155. GamePad                                     14
156. KNX                                         12
157. QPA: Direct2D                               11
158. ODF Writer                                   9
157. Network: SPDY                                8
158. GUI: Vulkan                                  8
159. Tools: Qt Configuration Tool                 7
160. QPA: KMS                                     6
161. Extras: X11                                  6
162. PIM: Versit                                  5
163. Cloud Messaging                              5
164. Testing: QtUITest                            5
165. Learning/Course Material                     4
166. PIM: Organizer                               4
167. SerialBus: Other                             3
168. Feedback                                     3
169. Systems: Publish & Subscribe                 2
170. Lottie                                       2
171. CoAP                                         1
172. Device Creation: Device Utilities            1

Above shows how the issue count distribution is very unbalanced.

To summarize the above, there are 172 components, with largely uneven distributions. Imagine trying to predict the correct copmponent from 172 options, given that for some of them, there is very limited data available. Would seem very difficult to learn to distinguish the ones with very little training data. I guess this skewed distribution might be due to new components having little data on them. Which, in a more realistic-scenario, would merit some additional consideration. Maybe collecting these all to a new category like "Other, manually check". And updating the training data constantly, re-training the model as new issues/bugs are added. Well, that is probably a good idea anyway.

Besides these components with very few linked issues, there are several in the dataset marked as "Inactive". These would likely also be beneficial to remove from the training set, since we would not expect to see any new ones coming for them. I did not do it, as for my experiment this is fine even without. In any case, this is what is looks like:

df_2019[df_2019["comp1"].str.contains("Inactive")]["comp1"].unique()
array(['(Inactive) Porting from Qt 3 to Qt 4',
       '(Inactive) GUI: QWS Integration (Qt4)', '(Inactive) Phonon',
       '(Inactive) mmfphonon', '(Inactive) Maemo 5', 
       '(Inactive) OpenVG',
       '(Inactive) EGL/Symbian', '(Inactive) QtQuick (version 1)',
       '(Inactive) Smart Installer ', '(Inactive) JsonDB',
       '(Inactive) QtPorts: BB10', '(Inactive) Enginio'], dtype=object)

I will use the "description" column for the features (the words in the description), and the filtered "comp1" column shown above for the target.

Creating an LSTM Model for Classification

This classifier code is available on Github. As shown above, some of the components have very few values (issues to train on). Longish story shorter, I cut out the targets with 10 or less values:

min_count = 10
df_2019 = df_2019[df_2019['comp1']
              .isin(counts[counts >= min_count].index)]

This enabled me to do a 3-way train-validation-test set split and still have some data for each 3 splits for each target component. A 3-way stratified split that is. Code:

def train_val_test_split(X, y):
    X_train, X_test_val, y_train, y_test_val = 
       train_test_split(X, y, test_size=0.2, 
                        random_state=42, stratify=y)
    X_val, X_test, y_val, y_test = 
       train_test_split(X_test_val, y_test_val, test_size=0.25,
                        random_state=42, stratify=y_test_val)
    return X_train, y_train, X_val, y_val, X_test, y_test

Before using that, I need to get the data to train, that is the X (features) and y (target).

To get the features, tokenize the text. For an RNN the input data needs to be a fixed length vector (of tokens), so cut the document at seq_length if longer, or pad it to length if shorter. This uses Keras tokenizer, which I guess should be quite confident to produce suitble output for Keras..

def tokenize_text(vocab_size, texts, seq_length):
    tokenizer = Tokenizer(num_words=vocab_size)
    tokenizer.fit_on_texts(texts)
    sequences = tokenizer.texts_to_sequences(texts)

    word_index = tokenizer.word_index
    print('Found %s unique tokens.' % len(word_index))

    X = pad_sequences(sequences, maxlen=seq_length)
    print('Shape of data tensor:', X.shape)

    return data, X, tokenizer

That produces the X. To produce the y:

from sklearn.preprocessing import LabelEncoder

df_2019.dropna(subset=['comp1', "Description"], inplace=True)
# encode class values as integers 
# so they work as targets for the prediction algorithm
encoder = LabelEncoder()
df_2019["comp1_label"] = encoder.fit_transform(df_2019["comp1"])

The "comp1_label" in above now has the values for the target y variable.

To put these together:

data = df_2019["Description"]
vocab_size = 20000
seq_length = 1000
data, X, tokenizer = tokenize_text(vocab_size, data, seq_length)

y = df_2019["comp1_label"]
X_train, y_train, X_val, y_val, X_test, y_test = 
                            train_val_test_split(X, y)

The 3 sets of y_xxxx variables still need to be converted to Keras format, which is a one-hot encoded 2D-matrix. To do this after the split:

y_train = to_categorical(y_train)
y_val = to_categorical(y_val)
y_test = to_categorical(y_test)

Word Embeddings

I am using Glove word vectors. In this case the relatively small set based on 6 billion tokens (words) with 300 dimensions. The vectors are stored in a text file, with one word per line, along with the vector values. First item on line is the word, followed by the 300 dimensional vector values for it. So the following loads this into the embedding_index dictionary, keys being words and values the vectors.

def load_word_vectors(glove_dir):
    print('Indexing word vectors.')

    embeddings_index = {}
    f = open(os.path.join(glove_dir, 'glove.6B.300d.txt'),
             encoding='utf8')
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs
    f.close()

    print('Found %s word vectors.' % len(embeddings_index))
    return embeddings_index

With these loaded, convert the embedding index into matrix form that the Keras Embedding layer uses. This simply puts the embedding vector for each word at a specific index in the matrix. So if word "bob" is in index 10 in word_index, the embedding vector for "bob" will be in embedding_matrix[10].

def embedding_index_to_matrix(embeddings_index, vocab_size,
                              embedding_dim, word_index):
    print('Preparing embedding matrix.')

    # prepare embedding matrix
    num_words = min(vocab_size, len(word_index))
    embedding_matrix = np.zeros((num_words, embedding_dim))
    for word, i in word_index.items():
        if i >= vocab_size:
            continue
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            # words not found in embedding index will be all-zeros.
            embedding_matrix[i] = embedding_vector
    return embedding_matrix

Build the bi-LSTM model

I use two model versions here. The first one uses the basic LSTM layer from Keras. The second one uses the Cuda optimized CuDNNLSTM layer. I used the CuDNNLSTM layer to train this model on GPU, saved the weights after training, and then loaded the weights into the plain LSTM version. I then used the plain LSTM version to do the predictions on my laptop when developing and demoing this.

Plain LSTM version:

def build_model_lstm(vocab_size, embedding_dim, 
                     embedding_matrix, sequence_length, cat_count):
    input = Input(shape=(sequence_length,), name="Input")
    embedding = Embedding(input_dim=vocab_size, 
                          weights=[embedding_matrix],
                          output_dim=embedding_dim,
                          input_length=sequence_length,
                          trainable=False,
                          name="embedding")(input)
    lstm1_bi1 = Bidirectional(LSTM(128, return_sequences=True,
                      name='lstm1'), name="lstm-bi1")(embedding)
    drop1 = Dropout(0.2, name="drop1")(lstm1_bi1)
    lstm2_bi2 = Bidirectional(LSTM(64, return_sequences=False,
                      name='lstm2'), name="lstm-bi2")(drop1)
    drop2 = Dropout(0.2, name="drop2")(lstm2_bi2)
    output = Dense(cat_count, 
                   activation='sigmoid', name='sigmoid')(drop2)
    model = Model(inputs=input, outputs=output)
    model.compile(optimizer='adam', 
          loss='categorical_crossentropy', metrics=['accuracy'])
    return model

CuDNNLSTM version:

def build_model_lstm_cuda(vocab_size, embedding_dim, 
                 embedding_matrix, sequence_length, cat_count):
    input = Input(shape=(sequence_length,), name="Input")
    embedding = Embedding(input_dim=vocab_size,
                          output_dim=embedding_dim, 
                          weights=[embedding_matrix],
                          input_length=sequence_length,
                          trainable=False,
                          name="embedding")(input)
    lstm1_bi1 = Bidirectional(CuDNNLSTM(128, 
                              return_sequences=True, name='lstm1'),
                              name="lstm-bi1")(embedding)
    drop1 = Dropout(0.2, name="drop1")(lstm1_bi1)
    lstm2_bi2 = Bidirectional(CuDNNLSTM(64, 
                              return_sequences=False, name='lstm2'),
                              name="lstm-bi2")(drop1)
    drop2 = Dropout(0.2, name="drop2")(lstm2_bi2)
    output = Dense(cat_count, 
                   activation='sigmoid', name='sigmoid')(drop2)
    model = Model(inputs=input, outputs=output)
    model.compile(optimizer='adam', 
          loss='categorical_crossentropy', metrics=['accuracy'])
    return model

The structure of the above models visualizes to this:

model structure

The first layer is an embedding layer, and uses the embedding matrix from the pre-trained Glove vectors. This is followed by the two bi-LSTM layers, each with a Dropout layer behind it. The bi-LSTM layer looks at each word along with its context (as I discussed previously). The dropout layers help avoid overfitting. Finally, a dense layer is used to make the prediction from "cat_count" categories. Here cat_count is the number of categories to predict. It is actually categories and not cats, sorry about that.

The "weights=[embedding_matrix]" parameter given to the Embedding layer is what can be used to initialize the pre-trained word-vectors. In this case, those would be the Glove word-vectors. The current Keras Embedding docs say nothing about this parameter, which is a bit weird. Searching for this on the internet also seems to indicate it would be deprecated etc. but it also seems difficult to find a simple replacement. But it works, so I go with that..

In a bit more detail, the model summarizes to this:

model.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
Input (InputLayer)           (None, 1000)              0         
_________________________________________________________________
embedding (Embedding)        (None, 1000, 300)         6000000   
_________________________________________________________________
lstm-bi1 (Bidirectional)     (None, 1000, 256)         440320    
_________________________________________________________________
drop1 (Dropout)              (None, 1000, 256)         0         
_________________________________________________________________
lstm-bi2 (Bidirectional)     (None, 128)               164864    
_________________________________________________________________
drop2 (Dropout)              (None, 128)               0         
_________________________________________________________________
sigmoid (Dense)              (None, 165)               21285     
=================================================================
Total params: 6,626,469
Trainable params: 626,469
Non-trainable params: 6,000,000
_________________________________________________________________

This shows how the embedding layer turns the input into a suitable shape for LSTM input as I discussed in my previous post. That is, 1000 timesteps, each with 300 features. Those being the 1000 tokens for each document (issue report description) and 300 dimensional word-vectors for each token.

Another interesting point is the text at the end of the summary: "Non-trainable params: 6,000,000". This matches the number of parameters in the summary for the embedding layer. When the embedding layer is given the paremeter "trainable=False", all the parameters in it are fixed. If this is set to True, then all these parameters will be trainable as well.

Training it

Training the model is simple now that everything is set up:

checkpoint_callback = ModelCheckpoint(filepath=
   "./model-weights-issue-pred.{epoch:02d}-{val_loss:.6f}.hdf5",
   monitor='val_loss', verbose=0, save_best_only=True)

model = build_model_lstm_cuda(vocab_size=vocab_size,
                    embedding_dim=embedding_dim,
                    sequence_length=seq_length,
                    embedding_matrix=embedding_matrix,
                    cat_count=len(encoder.classes_)

history = model.fit(X_train, y_train,
          batch_size=128,
          epochs=15,
          validation_data=(X_val, y_val),
          callbacks=callbacks)

model.save("issue_model_word_embedding.h5")

score, acc = model.evaluate(x=X_test,
                            y=y_test,
                            batch_size=128)
print('Test loss:', score)
print('Test accuracy:', acc)

Notice that I use the build_model_lstm_cuda() version here. That is to train on the GPU environment, to have some sensible training time.

The callback given will just save the model weights when the validation score improves. In this case it monitors the validation loss getting smaller (no mode="max" as in my previous version).

Predicting

Prediction with a trained model:

def predict(bug_description, seq_length):
    #texts_to_sequences vs text_to_word_sequence?
    sequences = tokenizer.texts_to_sequences([bug_description])

    word_index = tokenizer.word_index
    print('Found %s unique tokens.' % len(word_index))

    X = pad_sequences(sequences, maxlen=seq_length)

    probs = model.predict(X)
    result = []
    for idx in range(probs.shape[1]):
        name = le_id_mapping[idx]
        prob = (probs[0, idx]*100)
        prob_str = "%.2f%%" % prob
        #print(name, ":", prob_str)
        result.append((name, prob))
    return result

Running this on the bug QTBUG-74496 gives the following top predictions:

Quick: Other: 0.3271
(Inactive) QtQuick (version 1): 0.4968
GUI: Drag and Drop: 0.7292
QML: Declarative and Javascript Engine: 2.6450
Quick: Core Declarative QML : 5.8533

The bigger number signified higher likelihood given by the classifier. This highlights many of the topics I mentioned above. There is one inactive component there, which relates to perhaps it being better to remove all inactive ones from training set. The top one presented (Quick: Core Declarative QML) is not the one assigned to the report at this time, but the second highest is (QML: Declarative and Javascript Engine). They both seem to be associated with the same top-level component (QML) and I do not have the expertise to say why one might be better than the other in this case.

In most of the issue reports I tried this on, it seemed to get the "correct" one as marked on the issue tracker. In the ones that did not match, the suggestion always seemed to make sense (sometimes more than what had been set by whoever sets the value), and commonly in case of mismatch, the "correct" one was in the top suggestions still. But overall, component granularity might useful to consider as well in building these types of classifiers and their applications.

Usefulness of pre-trained word-embeddings

When doing the training, I started with using the Glove embeddings, and sometimes just accidentally left them out and trained without them. This reminded me to do an experiment to see how much effect using the pre-trained embeddings actually have, and how does the accuracy etc. get affected with or without them. So I tried to train the model with different options for the Embedding layer:

  • fixed Glove vectors (trainable=False)
  • trainable Glove vectors (trainable=True)
  • trainable, non-initialized vectors (trainable=True, no Glove)

The training accuracy/loss curves look like this:

embedding accuracies

The figure is a bit small but you can hopefully expand it. At least I uploaded it bigger :).

These results can be summarized as:

  • Non-trainable Glove alone improves all the way for the 15 iterations (epochs) and reaches validation accuracy of about 0.4 and loss around 2.55. The improvements got really small there so I did not try further epochs.
  • Trainable, uninitialized (no Glove) got best validation accuracy/loss at epoch 11 for 0.488 accuracy and 2.246 loss. After this it overfits.
  • Trainable with Glove initialization reaches best validation accuracy/loss at epoch 8 for 0.497 accuracy and 2.154 loss. After this it overfits.

Some interesting points I gathered from this:

  • Overall, non-trainable Glove gives quite poor results (but I guess still quite usable) over trainable embeddings in the two other options.
  • Glove initialized but further trained embeddings converge much faster and get better scores.
  • I guess further trained Glove embeddings would be a form of "transfer" learning. Cool, can I put that in my CV now?

My guess is that the bug descriptions have many terms not commonly used in general domains, which causes the effect of requiring updates of the general Glove embeddings to be more effective. When I did some exploratory analysis of the data (e.g., TF-IDF across components) these types of terms were actually quite visible. However, they are intermixed with the general terms, which benefit from Glove and mixing the two gives best results. Just my "educated" guess. Another of my guesses is that, this would be quite a similar result in other domains as well, with domain specific terminology.

General Notes (or "Discussion" in academic terms..)

The results from the training and validation showed about 50% accuracy. In a binary classification problem this would not be so great. More like equal to random guessing. But with 160+ targets to choose from, this seems to me to be very good. The loss is maybe a better metric, but I am not that good at interpreting the loss against 160+ categories. Simply smaller is better, and it should overall signify how far off the predictions for categories are. (but how to measure and interpret the true distance of all, when you just give one as correct target label, you tell me?)

Also, as noted earlier, an issue can be linked to several components. And from my tries of running this with new data and comparing results, the mapping is not always very clear, and there can be multiple "correct" answers. This is also shown in my prediction example above. The results given by Keras predict() are actually listing the probabilities for each of the 160+ potential targets. So if accuracy just measures the likelihood of getting it exactly right, it misses the ones that are predicted at position 2, 3, and so on. One way I would see using this would be to provide assistance on selections to an expert analyzing the incoming bug reports. In such case, having "correct" answers even in the few top predictions would seem useful. Again, as shown in my prediction example above.

With that, the 50% chance of getting it 100% correct for the top prediction actually seems very good. Consider that there are over 160 possible targets to predict. So not nearly as simple as a binary classifier, where 50% would match random guessing. Here this is much better than that.

Besides the LSTM, I also tried a more traditional classifier on this same task. I tried several versions, include multinomial naive bayes, random forest, and lgbm. The features I provided were heavily pre-processed word tokens in their Tf-IDF format. That classifier was pretty poor, perhaps even useless poor. Largely perhaps due to my lack of skills in feature engineering any domain, and in lack of hyperparameter optimization. But it was still the case. With that background, I was surprised to see good performance from the LSTM version.

Overall, the predictions give by this classifier I built are not perfect but much closer than I expected to get. Out of 160+ categories getting most time the correct one, and often close to the correct one, based only on the natural language description was a very nice result for me.

In cases where the match is not perfect, I believe it still provides valuable input. Either the match is given in the other top suggestions, or maybe we can learn something about considering why some of the others are suggested, and is there some real meaning behind it. All the mismatches I found made sense when considering the reported issue from different angles. I would guess the same might hold for other domains and similar classifiers as well.

Many other topics to investigate would include:

  • different types of model architectures (layers, neuron counts, GRU, 1D CNN, …)
  • Attention layers (Keras still does not include support but they are very popular now in NLP)
  • Different dimensions of embeddings
  • Different embeddings initializers (Word2Vec)
  • effects of more preprocessing
  • N-way cross-validation
  • training the final classifier on the whole training data at once, after finishing the model tuning etc.

Learning to LSTM

What is this

This is about time-series prediction/classification with neural networks using Keras. I will not go into theory or description of recurrent neural nets or LSTM itself, rather there are plenty tutorials out there. Search engines give plenty more. Try some if not already familiar. I just try to focus on what I found confusing after reading those, and how did that go.

EDIT: Links Kaggle kernel, Github

Overly long intro

Recently I have been trying to learn to use LSTM (Long Short Term Memory) networks. I picked up the Kaggle VSB Power Line Fault Detection competition. The point of this competition was to classify power line signals as faulty or not faulty. The given data looks like this:

signal plot

Just based on the raw signal data, the challenge in this competition was to classify the signal as fault/not faulty.

As shown in the figure above, there are 800k values per signal in the data. There are 8712 signals in the training set. These signals are for 3-phase power measures. Always 3 signals together to form a single 3-phase "measurement" or observation. The total number of such 3-phase groups is 8712/3 = 2904. The data looks something like this:

train_meta = pd.read_csv("../input/metadata_train.csv")
#There are 3 rows/signals per measurement, so 6 prints first 2
train_meta.head(6)

The shape of the (training) data is 800k rows and 8712 columns, one column for each signal, one row for each signal value measured in 20 millisecond intervals.

df_sig.shape

(800000, 8712)

Few rows for the two of the first 3-signal sets (columns 1-3 and 4-6):

Goal is to use this raw data to identify fault patterns. LSTM networks were very popular in this competition as the data is a set of 8172 time-series instances. So good place to learn how to use LSTM.

I like Kaggle in general for this, as there are good kernels to get started, and discussion on what works. Of course, I always do poor on the competitions but I find it good practice anyway.

Input shape

One thing I found confusing, and based on my internet searches, many others too, is how to shape and form the input for the LSTM network. To train it, and to run the "predictions".

The input shape should be a 3-dimensional array of (number of observations, number of timesteps, number of features). Using the data from above, what would that be?

Each signal measurement would be one observation. In the way above data is structured, this would be 8712 observations, 800000 timesteps, and 1 feature (the raw value). So input shape would be (8712, 800000, 1).

This is because we have:

  • 8712 separate signal observations,
  • 800000 measurements for each signal (at 20ms intervals, so a time-series),
  • 1 feature, which is just the raw measurement value for the signal.

Number of timesteps

I initially tried to train the network with this setup. Didn’t go very well. Found some links that say the number of timesteps should be much less, such as below 200, max around 250-500, 10 steps or less, or up to 1000 timesteps. Not really sure what, but certainly much less than 800k.

One of the most popular kernels in the competition was using a 160 timestep version. How do you go from 800k timesteps to 160 timesteps? By "binning" the 800k into 160 buckets. The size of such bucket is 800000 / 160 = 5000. In Python this looks something like this:

bkt_count = 160
data_size = 800000
bkt_size = int(data_size/bkt_count)
for i in range(0, data_size, bkt_size):
    # cut data to bkt_size (bucket size)
    bkt_data_raw = data_sig[i:i + bkt_size]
    #now here bkt_data_raw needs to be summarized

The above would process the data into a set of 160 buckets, so producing the 160 timesteps.

Features

We can then calculate various summary statistics for these "buckets" as features, for example:

bkt_data_raw = data_sig[i:i + bkt_size]
bkt_avg_raw = bkt_data_raw.mean() #1
bkt_sum_raw = bkt_data_raw.sum() #1
bkt_std_raw = bkt_data_raw.std() #1
bkt_std_top = bkt_avg_raw + bkt_std_raw #1
bkt_std_bot = bkt_avg_raw - bkt_std_raw #1
bkt_percentiles = np.percentile(bkt_data_raw, [0, 1, 25, 50, 75, 99, 100]) #7

The above gives 1+1+1+1+1+7 = 12 features. I had a few more features, and experimented with various others, but this works to illustrate the concept.

So, assume we have the 160 timesteps with 12 summary features each. This would mean the LSTM input shape is now: (8712, 160, 12).

Transforming the input to 3-dimensions

Assume we loop through all the data (800k values) for a signal, and turn each into 160 rows with 12 features. How do we turn that into (8712, 160, 12) for the LSTM? The key (for me) is in the numpy.reshape() function. We can do something like this:

X = df_train.values.reshape(measures, timesteps, total_features)

This works nicely because underneath most things in Python ML libraries rely on Numpy arrays (or arrays of arrays etc.). Assuming that df_train in the code above has a number of values that is divisible by the given reshape sizes, Numpy will reshape (as the name says..) it into a set of multi-dimensional arrays as requested.

In the above the number of elements for one signal is 160*12=1920. Because a signal now has 160 timesteps (buckets) each with 12 features. Since we know the number of rows (measured signals) we expect to have, the total can be calculated as 8172*1920 = 15690240.

There are many ways to build the initial set of features. I started with putting all the features for a timestep on a single row. So I had rows with 1920 columns. 8172 such rows. Reshaping in the end worked fine and produced the correct shape of (8172,160,12).

However, with such a shape, it was a bit difficult to apply Pandas operations on the data, such as scaling all the features using sklearn scalers. Pandas operations expect a dataframe to have 2-dimensional data, where each column has data for a single feature. For example, the scalers then scale one column at a time. With 1920 columns, each feature data was at every 12th column.

To address this, I ended up with data in the format of 160 * 8172 = 1307520 rows. Each row with N columns, where N equals the number of features I had. In this case 12. The 160 timesteps for each signal were each timestep on a single row, and 160 rows following each other for one signal. Immediately followed by another 160 for the next signal, and so on. Final size in this case being a dataframe with a shape of (1307520,12). As calculated above this is the 160 timesteps for all 8172 rows, meaning 160 * 8172 = 1307520 rows.

This would look something like this for the first signal:

and for the second signal:

and so on for all 8172 signals in chunks of 160 rows.. So each feature in its own column, each timestep on its own row.

Since all the features are in the same column for all the signals and all timesteps, general transformation tools such as sklearn scalers can be applied very simply:

minmax = MinMaxScaler(feature_range=(-1,1))
df_train_scaled = pd.DataFrame(minmax.fit_transform(df_train))

This scales all the features for all timesteps and all observations (signals here) at once correctly. Well, I think it does anyway, so correct me if wrong :).

The scaled version for the 1st signal:

And to reshape it into the 3-dimensional array shaped (8172, 160, 12):

observations = 8172
timesteps = 160
total_features = 12
X = df_train_scaled.values.reshape(observations, timesteps, total_features)

The dataframe.values is a Numpy array, so the Numpy.reshape function is there.

X would now be suitable to use as input for an LSTM network. The one I have not shown how to build yet. Only a million words and stuff so far but no LSTM and this was to be all about LSTM. Nice. Good job.

Parallelizing pre-processing

On Kaggle, I eventually built my own kernel for this pre-processing to create features and scale them, and to save the results for later model experiments. It uses Python multiprocessing to create the buckets and features. I found this to have multiple benefits:

  • The transformation of the raw input data can be done once and used many times to experiment with different LSTM and other ML models and NN architectures. Since pre-processing large datasets takes time, this speeds up the process immensely.

  • Pandas is built to be single-core (as is Python itself..), so you have to do tricks like this or else your multiple cores are just sitting idly and wasted.

  • Kaggle specific: By running preprocessing in a separate kernel, I can run it in parallel in one kernel while experimenting with models in other kernels.

  • Kaggle specific: Kaggle CPU kernels have 4 CPU cores, allowing 2*faster preprocessing than in GPU kernels which have only 2 CPU cores. But you need GPU kernels to build LSTM models.

In some parallel architectures like PySpark this would be less of a problem, but I do not have access to such systems, so I work with what I have, huh. Maybe someday if I am rich enough or whatever. Plz send food, beer, monies, and epic computer parts, thx.

Building the LSTM model

I made several attempts at building a model. Here is one quite directly based on the popular Kaggle kernel I linked at the beginning:

def create_model4(input_data):
    input_shape = input_data.shape
    inp = Input(shape=(input_shape[1], input_shape[2],), name="input_signal")
    x = Bidirectional(CuDNNLSTM(128, return_sequences=True, name="lstm1"), name="bi1")(inp)
    x = Bidirectional(CuDNNLSTM(64, return_sequences=False, name="lstm2"), name="bi2")(x)
    x = Dense(128, activation="relu", name="dense1")(x)
    x = Dense(64, activation="relu", name="dense2")(x)
    x = Dense(1, activation='sigmoid', name="output")(x)
    model = Model(inputs=inp, outputs=x)
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=[matthews_correlation])
    return model

Few points here:

  • CuDNNLSTM is a specialized LSTM layer optimized for NVidia GPU’s. I always though Keras would come with automatically chosen optimal GPU implementation and pick it automatically based on backend. Seems not.

  • CuDNN is NVIDIA’s Deep Neural Network library.

  • Bidirectional LSTM refers to a network using information about "past and future". This is only useful if you actually know the future, as in translating text from one language to another and you know what words will follow the current word. In this case, I know the whole signal, so the LSTM can also look at what values follow the current "bin" value in the following bins (timesteps).

  • The "return_sequences" flag tells the layer to give out a similar 3D array as the original input. So each of the 128 cells in the first layer will not just output on value for the sequence but also output an intermediate value for each timestep. So instead of shape (128), the output would be (160, 128).

  • The second LSTM layer outputs a shape of (64), which is just the final output of the LSTM processing the timesteps. This is suitable shape for the following dense layer, since it is just an array.

But let’s see how the model actually looks like.

Inspecting the model

First, a one-way model to keep it a bit simpler still:

def create_model3(input_data):
    input_shape = input_data.shape
    inp = Input(shape=(input_shape[1], input_shape[2],), name="input_signal")
    x = CuDNNLSTM(128, return_sequences=True, name="lstm1")(inp)
    x = CuDNNLSTM(64, return_sequences=False, name="lstm2")(x)	    
    x = Dense(128, activation="relu", name="dense1")(x)
    x = Dense(64, activation="relu", name="dense2")(x)
    x = Dense(1, activation='sigmoid', name="output")(x)
    model = Model(inputs=inp, outputs=x)
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=[matthews_correlation])
    return model

Summarizing the model:

model.summary()

Gives:

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_signal (InputLayer)    (None, 160, 12)           0         
_________________________________________________________________
lstm1 (CuDNNLSTM)            (None, 160, 128)          72704     
_________________________________________________________________
lstm2 (CuDNNLSTM)            (None, 64)                49664     
_________________________________________________________________
dense1 (Dense)               (None, 128)               8320      
_________________________________________________________________
dense2 (Dense)               (None, 64)                8256      
_________________________________________________________________
output (Dense)               (None, 1)                 65        
=================================================================
Total params: 139,009
Trainable params: 139,009
Non-trainable params: 0
_________________________________________________________________

And to visualize this (in Jupyter):

from keras.utils import plot_model
plot_model(model, show_shapes=True, to_file="lstm.png")
from IPython.display import Image
Image("lstm.png")

Giving us:

This shows how the LSTM parameters affect the basic structure of the network.

  • input_signal: layer shows the shape of the input data: 160 time steps, 12 features. None for any number of rows (observations).

  • lstm1: 128 LSTM units, with return_sequences=True. Now the output is (None, 160, 128), where 128 matches the number of LSTM units, and replaces the number of features in the input. So 128 features, each one produced by a single LSTM "unit".

  • lstm2: 64 LSTM units, with return_sequences=False. Output shape is (None, 64). Outputting values at every timestep is disabled (return_sequences=False), so this is just 2 dimensional output. 64 features, one for each LSTM "unit". This for each row in the input as it comes through lstm1.

  • dense1, dense2, output: regular dense network layers.

And the same for a bi-directional LSTM:

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_signal (InputLayer)    (None, 160, 60)           0         
_________________________________________________________________
bi1 (Bidirectional)          (None, 160, 256)          194560    
_________________________________________________________________
bi2 (Bidirectional)          (None, 128)               164864    
_________________________________________________________________
dense1 (Dense)               (None, 128)               16512     
_________________________________________________________________
dense2 (Dense)               (None, 64)                8256      
_________________________________________________________________
output (Dense)               (None, 1)                 65        
=================================================================
Total params: 384,257
Trainable params: 384,257
Non-trainable params: 0
_________________________________________________________________

So bi-directional doubles the number of features the LSTM layers produce, as it goes both forward and backward across the timesteps.

Training and predicting

Training and predicting with the LSTM model is no different from others, but a quick look anyway.

Training:

splits = list(StratifiedKFold(n_splits=N_SPLITS, shuffle=True, random_state=2019).split(X, y))
preds_val = []
y_val = []
for idx, (train_idx, val_idx) in enumerate(splits):
    K.clear_session()
    print("Beginning fold {}".format(idx+1))
    train_X, train_y, val_X, val_y = X[train_idx], y[train_idx], X[val_idx], y[val_idx]
    model = model_lstm(train_X.shape)
    ckpt = ModelCheckpoint('weights_{}.h5'.format(idx), save_best_only=True, save_weights_only=True, verbose=1, monitor='val_matthews_correlation', mode='max')
    model.fit(train_X, train_y, batch_size=128, epochs=50, validation_data=[val_X, val_y], callbacks=[ckpt])
    model.load_weights('weights_{}.h5'.format(seed, idx))

Prediction:

preds = []
for i in range(N_SPLITS):
    model.load_weights('weights_{}.h5'.format(i))
    pred = model.predict(X_test_input, batch_size=300, verbose=1)
    pred_3 = []
    for pred_scalar in pred:
        for i in range(3):
            pred_3.append(pred_scalar)
    preds.append(pred_3)
threshold = 0.5
preds_test = (np.squeeze(np.mean(preds_test, axis=0)) > threshold).astype(np.int)
submission['target'] = preds
submission.to_csv('submission_{}.csv'.format(seed), index=False)
submission.head()

With this, I managed to train the LSTM and have it produce actually meaningful results. Not that I got anywhere in the competition but it was a good experience for me, since my first tries I produced some completely broken input data, formatted wrong and all. The training with that broken data format delivered zero results, which might be expected since the features were all wrong, and data for them was mixed up. I feel I mostly learned to use an LSTM, even if I failed the Kaggle competition for the smallest fractions of target metric.

From all this, I wandered a bit to look at other things, such as what is the actual structure of an LSTM network when build with Keras, how to combine other types of data and layers alongside LSTM in the same network, and what is adversarial validation. Maybe I will manage to write a bit about them next.

My Pandas Cheat Sheet

https://stackoverflow.com/questions/36794433/python-using-multiprocessing-on-a-pandas-dataframe http://www.racketracer.com/2016/07/06/pandas-in-parallel/

  import pandas as pd

  pd.__version__

Creating Dataframes

  df = pd.DataFrame()

select multiple columns as a dataframe from a bigger dataframe:

  df2 = df[['Id', 'team', 'winPlacePerc']]

select a single column as a dataframe:

  df2 = df[['name']] 
#double square brackets make the results dataframe, 
#single makes it series

pandas axis:

  axis 1 = columns, axis 0 = rows

get a series from a dataframe column filtered by another column:

  zero_names = df[df["weights"] < 10000]["names"]

turn it into a list:

  zn_list = zero_names.tolist()

N-way split:

  X_train, X_test, y_train, y_test, r_train, r_test = 
   model_selection.train_test_split(X, y, r, 
      test_size=0.25, random_state=99)

load only part of data: nrows

Accessing Values

find row at 1111th index in dataframe:

  df.iloc[1111]

find row with value 1111 for index in dataframe:

  df.loc[1111]

set value in cell (stackoverflow):

  children_df.at[555, "visited"] = True 
    #555 here refers to row index

modify row while iteration over it:

  for index, row in df.iterrows():
    df.at[index,'open_change_p1'] = day_change_op1

drop last 5 columns

  df_train.drop(df_train.columns[-5:], axis=1)

print numpy data types in multidimensional array (in jupyter return value is printed):

[type(row) for row in values[0]]

Aggregates

calculate mean for ‘team’ and ‘winPlacePerc’ columns, after grouping them by match id and group id:

  agg_cols = ['groupId', 'matchId', 'team', 'winPlacePerc']
  df_mean = df_test[agg_cols].groupby(["groupId", "matchId"],
    as_index=False).mean().add_suffix("_mean")

run multiple such aggregations at once:

  agg = df_train.groupby(['matchId'])
    .agg({'players_in_team': ['min', 'max', 'mean', 'median']})

specify the name suffixes of the generated aggregation column names:

  agg_train = df_train[agg_cols].groupby(["groupId", "matchId"], 
                        as_index=False)
    .agg([('_min', 'min'), ('_max', 'max'), ('_mean', 'mean')])

multiple aggregations will create a 2-level column header (see df.head()). to change it into one level (for merge etc):

  mi = agg_train.columns #mi for multi-index
  ind = pd.Index([e[0] + e[1] for e in mi.tolist()])
  agg_train.columns = ind

custom aggregation function:

  def q90(x):
      return x.quantile(0.9)

  agg = df_train.groupby(['matchId'])
    .agg({'players_in_team': 
       ['min', 'max', 'mean', 'median', q90]})

create new column as number of rows in a group:

  agg = df_train.groupby(['matchId'])
          .size().to_frame('players_in_match')

group and sum column for groups:

  revenues = df_train.groupby("geoNetwork_subContinent")
     ['totals_transactionRevenue'].sum()
     .sort_values(ascending=False)

Merge

merge two dataframes (here df_train and agg) by a single column:

  df_train = df_train.merge(agg, how='left', on=['groupId'])

merge on multiple columns:

  dfg_train = dfg_train.merge(agg_train, how='left',
     on=["groupId", "matchId"])

set merge suffixes = ["", "_rank"] <- left and right side on overlapping columns

  dfg_test = dfg_test.merge(agg_test_rank, suffixes=["", "_rank"],
        how='left', on=["groupId", "matchId"])

above sets columns from dfg_test to have no suffix ("") and columns from agg_test_rank to have suffix "_rank"

merge columns by value:

  df_train['rankPoints'] = np.where(df_train['rankPoints'] < 0,
       df_train['winPoints'], df_train['rankPoints'])

->above sets value as winpoints if rankpoints is below zero, else rankpoints

merge by defining the column names to match on left and right:

  pd.merge(left, right, left_on='key1', right_on='key2')

merge types (the how=merge_type in pd.merge)(link):

  • inner: keep rows that match in both left and right
  • outer: keep all rows in both left and right
  • left: keep all rows from left and matching ones from right
  • right: keep all rows from right and matching ones from left

Rename, Delete, Compare Dataframes

rename columns to remove a suffix (here remove _mean):

  df_mean = df_mean.rename(columns={'groupId_mean': 'groupId',
               'matchId_mean': 'matchId'}) 

delete dataframes to save memory:

  del df_agg

find all columns in one dataframe but not in another

  diff_set = set(train_df.columns).difference(set(test_df.columns))
  print(diff_set)

find all columns in one both dataframes and drop them

  common_set = set(train_df.columns).intersection(set(FEATS_EXCLUDED))
  train_df = train_df.drop(common_set, axis=1)

drop rows with index in list (stackoverflow):

  df.drop(df.index[[1,3]])

replace nulls/nans with 0:

  X.fillna(0)
  X_test.fillna(0)

only for specific columns:

  df[['a', 'b']] = df[['a','b']].fillna(value=0)

drop rows with null values for specific column:

  df_train.dropna(subset=['winPlacePerc'], inplace=True)

drop columns B and C:

  df.drop(['B', 'C'], axis=1)
  df.drop(['B', 'C'], axis=1, inplace = True)

Dataframe Statistics

this df has 800k rows (values) and 999 columns (features):

  df_train_subset.shape
  (800000, 999)

data types:

  df_train.dtypes

number of rows, columns, memory size (light, fast):

  df.info()

statistics per feature/column (cpu/mem heavy):

  df.describe()

number of unique values:

  df.nunique()

bounds:

  df.min()
  df.max()

replace infinity with 0. esp scalers can have issues with inf:

    X[col] = X[col].replace(np.inf, 0)

replace positive and negative inf with nan:

  df_pct.replace([np.inf, -np.inf], np.nan)

number of non-nulls per row in dataframe:

  df_non_null = df.notnull().sum(axis=1)

find the actual rows with null values:

  df_train[df_train.isnull().any(axis=1)]

number of unique values

  df.nunique()

number of times different values appear in column:

  df_train["device_operatingSystem"].value_counts()

sorted by index (e.g., 0,1,2,…)

  phase_counts = df_tgt0["phase"].value_counts().sort_index()

number of nans per column in dataframe:

  df.isnull().sum()

find columns with only one unique value

  const_cols = [c for c in train_df.columns 
                if train_df[c].nunique(dropna=False)==1 ]

drop them

  train_df = train_df.drop(const_cols, axis=1)

replace outlier values over 3std with 3std:

  upper = df_train['totals_transactionRevenue'].mean()
            +3*df_train['totals_transactionRevenue'].std()
  mask = np.abs(df_train['totals_transactionRevenue']
           -df_train['totals_transactionRevenue'].mean()) >
           (3*df_train['totals_transactionRevenue'].std())
  df_train.loc[mask, 'totals_transactionRevenue'] = upper

or use zscore:

  df['zscore'] = (df.a - df.a.mean())/df.a.std(ddof=0)

from scipy (stackoverflow)

  from scipy import stats
  import numpy as np
  z = np.abs(stats.zscore(boston_df))
  print(z)

show groupby object data statistics for each column by grouped element:

  grouped.describe()

create dataframe from classifier column names and importances (where supported), sort by weight:

  df_feats = pd.DataFrame()
  df_feats["names"] = X.columns
  df_feats["weights"] = clf.feature_importances_
  df_feats.sort_values(by="weights")

lag columns to show absolute value they changed over rows:

  for col in input_col_names:
    df_diff[col] = df_sig[col].diff().abs()

unique datatypes in dataframe:

  train_df.dtypes.unique()

show columns of selected type:

  train_df.select_dtypes(include=['float64'])

Numpy

Get numpy matrix from dataframe:

  data_diffs = df_diff.values

get column from numpy matrix as row

  for sig in range(0, 3):
      data_sig = data_measure[:,sig]

slice numpy array:

  bin_data_raw = data_sig[i:i + bin_size]

calculate statistics over numpy array:

  bin_avg_raw = bin_data_raw.mean()
  bin_sum_raw = bin_data_raw.sum()
  bin_std_raw = bin_data_raw.std()
  bin_percentiles = np.percentile(bin_data_raw, [0, 1, 25, 50, 75, 99, 100])
  bin_range = bin_percentiles[-1] - bin_percentiles[0]
  bin_rel_perc = bin_percentiles - bin_avg_raw

count outliers at different scales:

  outliers = []
  for r in range(1, 7):
      t = bin_std_diff * r
      o_count = np.sum(np.abs(bin_data_diff-bin_avg_diff) >= t)
      outliers.append(o_count)

concatenate multiple numpy arrays into one:

  bin_row = np.concatenate([raw_features, diff_features, bin_percentiles, bin_rel_perc, outliers])

limit values between min/max:

  df_sig.clip(upper=127, lower=-127)

value range in column (ptp = peak to peak):

  df.groupby('GROUP')['VALUE'].agg(np.ptp)

replace nan:

  my_arr[np.isnan(my_arr)] = 0

Time-Series

create values for percentage changes over time (row to row):

  df_pct = pd.DataFrame()
  for col in df_train_subset.columns[:30]:
      df_pct['pct_chg_'+col] =
          df_train_subset[col].pct_change().abs()

pct_change() gives the change in percentage over time, abs() makes it absolute if just looking for overall change as in negative or positive.

also possible to set number of rows to count pct_change over:

  df_pct['pct_chg_'+col] =
          df_train_subset[col].pct_change(periods=10)

pct_change on a set of items with specific values:

  df['pct_chg_open1'] = df.groupby('assetCode')['open']
           .apply(lambda x: x.pct_change())

TODO: try different ways to calculate ewm to see how this all works ewm ([stackoverflow]( EWMA: https://stackoverflow.com/questions/37924377/does-pandas-calculate-ewm-wrong)):

  df["close_ewma_10"] = df.groupby('assetName')['pct_chg_close1']
            .apply(lambda x: x.ewm(span=10).mean())

shift by 10 backwards

  y = mt_df["mket_close_10"].shift(-10)

Date Types

convert seconds into datetime

  start_col = pd.to_datetime(df_train.visitStartTime, unit='s')

parse specific columns as specific date types:

  df_train = pd.read_csv("../input/train-flat.csv.gz",
                         parse_dates=["date"], 
                         dtype={'fullVisitorId': 'str',
                              'date': 'str'
                             },
                        )

or if need to parse specific format:

  pd.to_datetime('13000101', format='%Y%m%d', errors='ignore')

access elements of datetime objects:

  data_df["weekday"] = data_df['date'].dt.weekday
  data_df["hour"] = data_df['date'].dt.hour

Data Type Manipulations

one-hot encode / convert categorical:

  dfg_train = pd.get_dummies( dfg_train, columns = ['matchType'] )
  dfg_test = pd.get_dummies( dfg_test, columns = ['matchType'] )

set category value by range:

  here 1 = solo game, 2 = duo game, 3 = squad, 4 = custom
  df_train['team'] = 
     [1 if i == 1 else 2 if i == 2 else 4 if i > 4 else 3 
              for i in df_train["players_in_team_q90"]]

calculate value over two columns and make it a new column:

  dfg_train['killsNorm'] = 
      dfg_train['kills_mean']*
      ((100-dfg_train['players_in_match'])/100 + 1)

  data_df['hit_view_ratio'] =
      data_df['totals_pageviews']/data_df['totals_hits']

mark a set of columns as category type:

  for col in X_cat_cols:
      df_train[col] = df_train[col].astype('category')

set category value based on set of values in column:

  X_test['matchType'] = X_test['matchType'].astype('str')
  X_test.loc[X_test['matchType'] == "squad-fpp", 'matchType'] =
     "squad"
  X_test['matchType'] = X_test['matchType'].astype('category')

how to get the indices from a dataframe as a list (e.g., collect a subset):

  list(outlier_collection.index.values)

to see it all on one line in Jupyter (easier copying, e.g. to drop all in list):

print(list(outlier_collection.index.values))

Jupyter

show dataframe as visual table in notebook (stackoverflow):

  from IPython.display import display, HTML

  display(df1)
  display(HTML(df2.to_html()))

pycharm notebook (jetbrains help):

correlations, unstacking correlation matrix link

Memory Reducer (From Kaggler:

def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object and col_type.name != 'category':
            #print(col_type)
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df

Investigating ROC/AUC

Executive Summary

ROC and AUC are terms that often come up in machine learning, in relation to evaluating models. In this post, I try examine what ROC curves actually are, how they are calculated, what is a threshold in ROC curve, and how it impacts the classification if you change it. The results show that using low threshold values leads to classifying more objects into positive category, and high threshold leads to more negative classifications. The Titanic dataset is used here as an example to study the threshold effect. A classifier gives probabilities for people to survive.

By default, the classifier seems to use 50% probability as a threshold to classify one as a survivor (positive class) or non-survivor (negative class). But this threshold can be varied. For example, using a threshold of 1% leads to giving a survivor label to anyone who gets a 1% or higher probability to survive by the classifier. Similarly, a threshold of 95% requires people to get at least 95% probability from the classifier to be given a survivor label. This is the threshold that is varied, and the ROC curve visualizes how the change in this probability threshold impacts classification in terms of getting true positives (actual survivors predicted as survivors) vs getting false positives (non-survivors predicted as survivors).

For the loooong story, read on. I dare you. 🙂

Introduction

Area under the curve (AUC) and receiver operator characteristic (ROC) are two terms that seem to come up a lot when learning about machine learning. This is my attempt to get it. Please do let me know where it goes wrong..

AUC is typically drawn as a curve using some figure like this (from Wikipedia):

Roccurves
FIGURE 1. Example ROC/AUC curve.

It uses true positive rate (TPR) and false positive rate (FPR) as the two measures to compare. A true positive (TP) being a correct positive prediction and a false positive (FP) being a wrong positive prediction.

There is an excellent introduction to the topic in the often cited An introduction to ROC analysis article by Tom Fawcett. Which is actually a very understandable article (for most part), unlike most academic articles I read. Brilliant. But still, I wondered, so lets see..

To see ROC/AUC in practice, instead of just reading about it all over the Internet, I decided to implement a ROC calculation and see if I can match it with the existing implementations. If so, that should at least give me some confidence on my understanding on the topic being correct.

Training a classifier to test ROC/AUC

To do this, I needed a dataset and a fitting of some basic classifier on that data to run my experiments. I have recently been going through some PySpark lectures on Udemy, so I could maybe learn some more big data stuffs and get an interesting big data job someday. Woohoo. Anyway. The course was using the Titanic dataset, so I picked that up, and wrote a simple classifier for it in Pandas/Scikit. Being a popular dataset, there is also a related Kaggle for it. Which is always nice, allowing me to create everything a bit faster by using the references, and focus on the ROC investigation faster.

The classifier I used is Logistic Regression, and the notebook is available on my GitHub (GitHub sometimes seems to have issues rendering but its there, will look into putting also on Kaggle later). Assuming some knowledge of Python, Pandas, sklearn, and the usual stuff, the training of the model itself is really the same as usual:

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=2)
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

The above is of course just a cut off into the middle of the notebook, where I already pre-processed the data to build some set of features into the features variable, and took the prediction target (survived) into the target variable. But that is quite basic ML stuff you find in every article on the topic, and on many of the “kernels” on the Kaggle page I linked. For more details, if all else fails, check my notebook linked above.

The actual training of the classifier (after preprocessings..) is as simple as the few lines above. That’s it. I then have a Logistic Regression classifier I can use to try and play with ROC/AUC. Since ROC seems to be about playing with thresholds over predicted class probabilities, I pull out the target predictions and their given probabilities for this classifier on the Titanic dataset:

predictions = logreg.predict(X_test)
predicted_probabilities = logreg.predict_proba(X_test)

Now, predicted_probabilities holds the classifier predicted probability values for each row in the dataset. 0 for never going to survive, 1 for 100% going to survive. Just predictions, of course, not actual facts. That’s the point of learning a classifier, to be able to predict the result for instance where you don’t know in advance, as you know.. 🙂

Just to see a few things about the data:

predicted_probabilities.shape

>(262, 2)

X_test.shape

>(262, 9)

This shows the test set has 262 rows (data items), there are 9 feature variables I am using, and the prediction probabilities are given in 2 columns. The classifier is a binary classifier, giving predictions for a given data instance as belonging to one of the two classes. In this case it is survived and not survived. True prediction equals survival, false prediction equals non-survival. The predicted_probability variable contains probabilities for false in column 0 and true in column 1. Since probability of no survival (false) is the opposite of survival (1-survival_probability, (true)), we really just need to keep one of those two columns. Because if it is not true, it has to be false. Right?

Cutting the true/false predictions to only include the true prediction, or the probability of survival (column 1, all rows):

pred_probs = predicted_probabilities[:, 1]

Drawing some ROC curves

With this, getting the ROC curve from sklearn is as simple as:

[fpr, tpr, thr] = roc_curve(y_test, pred_probs)

The Kaggle notebook I used as a reference, visualizes this and also drew fancy looking two dashed blue lines to illustrate a point on the ROC curve, where 95% of the surviving passengers were identified. This point is identified by:

# index of the first threshold for which the TPR > 0.95
idx = np.min(np.where(tpr > 0.95))

So it is looking for the minimum index in the TPR list, where the TPR is above 95%. This results in a ROC curve drawing as:

plt.figure(figsize=(10, 6), dpi=80)
plt.plot(fpr, tpr, color='coral', label='ROC curve (area = %0.3f)' % auc(fpr, tpr))
plt.plot([0, 1], [0, 1], 'k--')
plt.plot([0,fpr[idx]], [tpr[idx],tpr[idx]], 'k--', color='blue')
plt.plot([fpr[idx],fpr[idx]], [0,tpr[idx]], 'k--', color='blue')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate (1 - specificity)', fontsize=14)
plt.ylabel('True Positive Rate (recall)', fontsize=14)
plt.title('Receiver operating characteristic (ROC) curve')
plt.legend(loc="lower right")
plt.show()

The resulting figure is:

roc0
FIGURE 2. The ROC curve I nicked from Kaggle.

How to read this? In bottom left we have zero FTP, and zero TPR. This means everything is predicted as false (no survivors). This would have threshold of 100% (1.0), meaning you just classify everything as false because no-one gets over 100% probability to survive. In top right corner both TPR and FPR are 1.0, meaning you classify everyone as a survivor, and no-one as a non-survivor. So false positives are maximized as are true positives, since everyone is predicted to survive. This is a result of a threshold of 0%, and everyone gets a probability higher than 0% to survive. The blue lines indicate a point where over 95% of true positives are identified (TPR value), simultaneously leading to getting about 66% FPR. Of course, the probability threshold is not directly visible in this but has to be otherwise looked up, as we shall see in all the text I wrote below.

To see the accuracy and AUC calculated for the base (logistic regression) classifier, and the AUC calculated for the ROC in the figure above:

print(logreg.__class__.__name__+" accuracy is %2.3f" % accuracy_score(y_test, predictions))
print(logreg.__class__.__name__+" log_loss is %2.3f" % log_loss(y_test, pred_probs))
print(logreg.__class__.__name__+" auc is %2.3f" % auc(fpr, tpr))

>LogisticRegression accuracy is 0.832
>LogisticRegression log_loss is 0.419
>LogisticRegression auc is 0.877

And some more statistics from sklearn to compare against when trying to understand it all:

confusion_matrix(y_test, predictions)

>array([[156,  19],
>       [ 25,  62]])

precision_score(y_test, predictions)

>0.7654320987654321

accuracy_score(y_test, predictions)

>.8320610687022901

In the confusion matrix, there are 156 correct non-survivor predictions (true negatives), 19 wrong ones (false negatives). 62 correct survivor predictions (true positives), 25 wrong ones (false positives). The precision of the classifier is given as 0.7654320987654321 and accuracy as 0.8320610687022901.

To evaluate my ROC/AUC calculations, I first need to try to understand how the algorithms calculate the metrics, and what are the parameters being passed around. I assume the classification algorithms would use 0.5 (or 50%) as a default threshold to classify a prediction with a probability of 0.5 or more as 1 (predicted true/survives) and anything less than 0.5 as 0 (predicted non-survivor/false). So let’s try to verify that. First, to get the predictions with probability values >= 0.5 (true/survive):

#ss is my survivor-set
ss1 = np.where(pred_probs >= 0.5)
print(ss1)

>(array([  0,   1,   3,   4,   8,  13,  16,  26,  27,  30,  32,  42,  43,
>         46,  49,  50,  57,  59,  66,  67,  76,  77,  79,  83,  84,  85,
>         86,  92,  93,  98, 101, 109, 111, 113, 117, 121, 122, 125, 129,
>        130, 133, 137, 140, 143, 145, 147, 148, 149, 150, 152, 158, 170,
>        175, 176, 177, 180, 182, 183, 184, 186, 201, 202, 204, 205, 206,
>        211, 213, 215, 218, 227, 230, 234, 240, 241, 242, 246, 248, 249,
>        250, 252, 256]),)

len(ss1[0])

>81

The code above filters the predicted probabilities to get a list of values where the probability is higher than or equal to 0.5. At first look, I don’t know what the numbers in the array are. However, my guess is that they are indices into the X_train array that was passed as the features to predict from. So at X_train indices 0,1,3,4,8,… are the data points predicted as true (survivors). And here we have 81 such predictions. That is the survivor predictions, how about non-survivor predictions?

Using an opposite filter:

ss2 = np.where(pred_probs 181

Overall, 81 predicted survivors, 181 predicted casualities. Since y_test here has known labels (to test against), we can check how many real survivors there are, and what is the overall population:

y_test.value_counts()

>0    175
>1     87

So the amount of actual survivors in the test set is 87 vs non survivors 175. To see how many real survivors (true positives) there are in the predicted 81 survivors:

sum(y_test.values[i] for i in ss1[0])

>62

The above is just some fancy Python list comprehension or whatever that is for summing up all the numbers in the predicted list indices. Basically it says that out of the 81 predicted survivors, 62 were actual survivors. Matches the confusion matrix from above, so seems I got that correct. Same for non-survivors:

sum(1 for i in ss2[0] if y_test.values[i] == 0)

>156

So, 156 out of the 181 predicted non-survivors actually did not make it. Again, matches the confusion matrix.

Now, my assumption was that the classifier uses the threshold of 0.5 by default. How can I use these results to check if this is true? To do this, I try to match the sklearn accuracy score calculations using the above metrics. Total correct classifications from the above (true positives+true negatives) is 156+62. Total number of items to predict is equal to the test set size, so 262. The accuracy from this is:

(156+62)/262

>0.8320610687022901

This is a perfect match on the accuracy_score calculated above. So I conclude with this that I understood the default behaviour of the classifier correct. Now to use that understanding to see if I got the ROC curve correct. In FIGURE 1 far above, I showed the sklearn generated ROC curve for this dataset and this classifier. Now I need to build my own from scratch to see if I  get it right.

Thresholding 1-99% in 1% increments

To build one a ROC curve myself using my own threshold calculations:

tpr2 = []
fpr2 = []
#do the calculations for range of 1-100 percent confidence, increments of 1
for threshold in range(1, 100, 1):
    ss = np.where(pred_probs >= (threshold/100))
    count_tp = 0
    count_fp = 0
    for i in ss[0]:
        if y_test.values[i] == 1:
            count_tp += 1
        else:
            count_fp += 1
    tpr2.append(count_tp/87.0)
    fpr2.append(count_fp/175.0)
    print("threshold="+str(threshold)+", count_tp="+str(count_tp)+", count_fp="+str(count_fp))

The above code takes as input the probabilities (pred_probs) given by the Logistic Regression classifier for survivors. It then tries the threshold values from 1 to 99% in 1% increments. This should produce the ROC curve points, as the ROC curve should describe the prediction TPR and FPR with different probability thresholds. The code gives a result of:

threshold=1, count_tp=87, count_fp=175
threshold=2, count_tp=87, count_fp=174
threshold=3, count_tp=87, count_fp=174
threshold=4, count_tp=87, count_fp=173
threshold=5, count_tp=87, count_fp=171
threshold=6, count_tp=87, count_fp=169
...
threshold=94, count_tp=3, count_fp=1
threshold=95, count_tp=1, count_fp=0
threshold=96, count_tp=0, count_fp=0
threshold=97, count_tp=0, count_fp=0
threshold=98, count_tp=0, count_fp=0
threshold=99, count_tp=0, count_fp=0

At 1% threshold, everyone that the classifier gave a probability of 1% or higher to survive is classified as a likely survivor. In this dataset, everyone is given 1% or more survival probability. This leads to everyone being classified as a likely survivor. Since everyone is classified as survivor, this gives true positives for all 87 real survivors, but also false positives for all the 175 non-survivors. At 6% threshold, the FPR has gone down a bit, with only 169 non-survivors getting false-positives as survivors.

At the threshold high-end, the situation reverses. At threshold 94%, only 3 true survivors get classified as likely survivors. Meaning only 3 actual survivors scored a probability of 94% or more to survive from the classifier. There is one false positive at 94%, so one who was predicted to have a really high probability to survive (94%) did not. At 95% there is only one predicted survivor, which is a true positive. After that no-one scores 96% or more. Such high outliers would probably make interesting points to look into in more detail, but I won’t go there in this post.

Using the true positive rates and false positive rates from the code above (variables tpr2 and fpr2), we can make a ROC curve for these calculations as:

plt.figure(figsize=(10, 6), dpi=80)
plt.plot(fpr2, tpr2, color='coral', label='ROC curve (area = %0.3f)' % auc(fpr2, tpr2))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize=14)
plt.ylabel('True Positive Rate', fontsize=14)
plt.title('ROC test')
plt.legend(loc="lower right")
plt.show()

roc2
FIGURE 3. ROC curve for thresholds of 1-99% in 1% steps.

Comparing this with the ROC figure form sklearn (FIGURE 2), it is a near-perfect match. We can further visualize this by plotting the two on top of each other:

plt.figure(figsize=(10, 6), dpi=80)
plt.plot(fpr2, tpr2, color='coral', label='ROC curve (area = %0.3f)' % auc(fpr2, tpr2))
plt.plot(fpr, tpr, color='blue', label='ROC curve (area = %0.3f)' % auc(fpr, tpr))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize=14)
plt.ylabel('True Positive Rate', fontsize=14)
plt.title('ROC test')
plt.legend(loc="lower right")
plt.show()

roc3
FIGURE 4. My 1-99% ROC vs sklearn ROC.

There are some very small differences in the sklearn version having more “stepped” lines, whereas the one I generated from threshold of 1-99% draws the lines a bit more straight (“smoother”). Both end up with the exact same AUC value (0.877). So I would call this ROC curve calculation solved, and claim that the actual calculations match what I made here. Solved as in I finally seem to have understood what it exactly means.

What is the difference

Still, loose ends are not nice. So what is causing the small difference in the ROC lines from the sklearn implementation vs my code?

The FPR and TPR arrays/lists form the x and y coordinates of the graphs. So looking at those might help understand a bit:

fpr.shape

>(95,)

len(fpr2)

>99

The above shows that the sklearn FPR array has 95 elements, while the one I created has 99 elements. Initially I thought, maybe increasing detail in the FPR/TPR arrays I generate would help match the two. But it seems I already generated more points than is in the sklearn implementation. Maybe looking into the actual values helps:

fpr

array([0.        , 0.        , 0.00571429, 0.00571429, 0.01142857,
       0.01142857, 0.01714286, 0.01714286, 0.02285714, 0.02285714,
       0.02857143, 0.02857143, 0.03428571, 0.03428571, 0.04      ,
       0.04      , 0.04571429, 0.04571429, 0.05142857, 0.05142857,
       0.05714286, 0.05714286, 0.06285714, 0.06285714, 0.07428571,
       0.07428571, 0.07428571, 0.08      , 0.08      , 0.09142857,
       0.09142857, 0.09714286, 0.09714286, 0.10857143, 0.10857143,
       0.12      , 0.12      , 0.14857143, 0.14857143, 0.16      ,
       0.16      , 0.17142857, 0.17142857, 0.17142857, 0.18857143,
       0.18857143, 0.21142857, 0.21142857, 0.22285714, 0.22857143,
       0.23428571, 0.23428571, 0.24      , 0.24      , 0.26285714,
       0.28571429, 0.32571429, 0.33714286, 0.34857143, 0.34857143,
       0.37714286, 0.37714286, 0.38285714, 0.38285714, 0.38857143,
       0.38857143, 0.4       , 0.4       , 0.43428571, 0.45714286,
       0.46857143, 0.62857143, 0.62857143, 0.63428571, 0.63428571,
       0.64      , 0.64      , 0.64571429, 0.65714286, 0.66285714,
       0.66285714, 0.68      , 0.69142857, 0.72      , 0.73142857,
       0.73714286, 0.74857143, 0.79428571, 0.82285714, 0.82285714,
       0.85142857, 0.86285714, 0.88      , 0.89142857, 1.        ])

np.array(fpr2)

array([1.        , 0.99428571, 0.99428571, 0.98857143, 0.97714286,
       0.96571429, 0.96571429, 0.95428571, 0.92      , 0.89714286,
       0.84571429, 0.78857143, 0.66285714, 0.63428571, 0.58857143,
       0.52      , 0.49142857, 0.48      , 0.40571429, 0.39428571,
       0.38285714, 0.37714286, 0.35428571, 0.33714286, 0.32      ,
       0.31428571, 0.29714286, 0.26285714, 0.25142857, 0.24571429,
       0.23428571, 0.22857143, 0.21142857, 0.21142857, 0.20571429,
       0.20571429, 0.2       , 0.19428571, 0.18857143, 0.18857143,
       0.17714286, 0.17714286, 0.17142857, 0.16      , 0.14857143,
       0.13714286, 0.13142857, 0.12      , 0.10857143, 0.10857143,
       0.10857143, 0.09714286, 0.09714286, 0.09714286, 0.08571429,
       0.08571429, 0.08      , 0.07428571, 0.07428571, 0.06857143,
       0.06285714, 0.05714286, 0.05714286, 0.05142857, 0.05142857,
       0.04571429, 0.04      , 0.04      , 0.04      , 0.04      ,
       0.03428571, 0.02857143, 0.02857143, 0.02857143, 0.02857143,
       0.02285714, 0.02285714, 0.02285714, 0.01714286, 0.01714286,
       0.01142857, 0.01142857, 0.01142857, 0.01142857, 0.01142857,
       0.01142857, 0.00571429, 0.00571429, 0.00571429, 0.00571429,
       0.00571429, 0.00571429, 0.00571429, 0.00571429, 0.        ,
       0.        , 0.        , 0.        , 0.        ])

In the above, I print two arrays/lists. The first one (fpr) is the sklearn array. The second one (fpr2) is the one I generated myself. The fpr2 contains many duplicate numbers one after the other, whereas fpr has much more unique numbers. My guess is, the combination of fpr and tpr, as in the sklearn values might have only unique points, whereas fpr2 and tpr2 from my code has several points repeating over multiple times.

What causes this? Looking at sklearn roc_curve method, it actually returns 3 values, and I so far only used 2 of those. The return values are for variables fpr, tpr, thr. The thr one is not yet used and is actually named thresholds in the sklearn docs. What are these thresholds?

thr

array([0.95088415, 0.94523744, 0.94433151, 0.88735906, 0.86820046,
       0.81286851, 0.80847662, 0.79241033, 0.78157711, 0.76377304,
       0.75724659, 0.72956179, 0.71977863, 0.70401182, 0.70160521,
       0.66863069, 0.66341944, 0.65174523, 0.65024897, 0.6342464 ,
       0.63001635, 0.61868955, 0.61479088, 0.60172207, 0.59225844,
       0.58177374, 0.58027421, 0.57885387, 0.56673186, 0.54986331,
       0.54937079, 0.54882292, 0.52469167, 0.51584338, 0.50426613,
       0.48014018, 0.47789982, 0.45232349, 0.44519259, 0.44203986,
       0.43892674, 0.43378224, 0.42655293, 0.42502826, 0.40283515,
       0.39572852, 0.34370988, 0.34049813, 0.32382773, 0.32099169,
       0.31275499, 0.31017684, 0.30953277, 0.30348576, 0.28008661,
       0.27736946, 0.24317994, 0.24314769, 0.23412169, 0.23370882,
       0.22219555, 0.22039886, 0.21542313, 0.21186501, 0.20940184,
       0.20734276, 0.19973878, 0.19716815, 0.18398295, 0.18369904,
       0.18369725, 0.14078944, 0.14070302, 0.14070218, 0.13874243,
       0.13799297, 0.13715783, 0.13399095, 0.13372175, 0.13345383,
       0.13057344, 0.1279474 , 0.1275107 , 0.1270535 , 0.12700496,
       0.12689617, 0.12680186, 0.11906236, 0.11834502, 0.11440003,
       0.10933941, 0.10860771, 0.10552423, 0.10468277, 0.0165529 ])

This array shows how they are quite different and there is no set value that is used to vary the threshold. Unlike my attempt at doing it in 1% unit changes, this list has much bigger and smaller changes in it. Let’s try my generator code with this same set of threshold values:

tpr3 = []
fpr3 = []
#do the calculations for sklearn thresholds
for threshold in thr:
    ss = np.where(pred_probs >= threshold)
    count_tp = 0
    count_fp = 0
    for i in ss[0]:
        if y_test.values[i] == 1:
            count_tp += 1
        else:
            count_fp += 1
    tpr3.append(count_tp/87.0)
    fpr3.append(count_fp/175.0)
    print("threshold="+str(threshold)+", count_tp="+str(count_tp)+", count_fp="+str(count_fp))

And with the TPR and FPR lists calculated for these thresholds, we can visualize them as well and compare against the sklearn coordinates:

plt.figure(figsize=(10, 6), dpi=80)
plt.plot(fpr3, tpr3, color='coral', label='ROC curve (area = %0.3f)' % auc(fpr3, tpr3))
plt.plot(fpr, tpr, color='blue', label='ROC curve (area = %0.3f)' % auc(fpr3, tpr3))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize=14)
plt.ylabel('True Positive Rate', fontsize=14)
plt.title('ROC test')
plt.legend(loc="lower right")
plt.show()

Giving the figure:

roc5
FIGURE 5. Overlapping ROC curves with shared thresholds.

This time only one line is visible since the two are fully overlapping. So I would conclude that at least now I got also my ROC curve understanding validated. My guess is that sklearn does some nice calculations in the back for the ROC curve coordinates, to identify threshold points where there are visible changes and only providing those. While I would then use the ROC function that sklearn or whatever library I use (e.g., Spark) provides, this understanding I managed to build should help me make better use of their results.

Varying threshold vs accuracy

OK, just because this is an overly long post already, and I still cannot stop, one final piece of exploration. What happens to the accuracy as the threshold is varied? As 0.5 seems to be the default threshold, is it always the best? In cases where we want to minimize false positives or maximize true positives, maybe the threshold optimization is most obvious. But in case of just looking at the accuracy, what is the change? To see, I collected accuracy for every threshold value suggested by sklearn roc_curve(), and also for the exact 0.5 threshold (which was not in the roc_curve list but I assume is the classifier default):

tpr3 = []
fpr3 = []
accuracies = []
accs_ths = []

#do the calculations for sklearn thresholds
for threshold in thr:
    ss = np.where(pred_probs >= threshold)
    count_tp = sum(y_test.values[i] for i in ss[0])
    count_fp = len(ss[0]) - count_tp
    my_tpr = count_tp/87.0
    my_fpr = count_fp/175.0

    ssf = np.where(pred_probs < threshold)
    count_tf = sum(1 for i in ssf[0] if y_test.values[i] == 0)

    tpr3.append(my_tpr)
    fpr3.append(my_fpr)
    acc = (count_tp+count_tf)/y_test.count()
    accuracies.append(acc)
    accs_ths.append((acc, threshold))
    print("threshold="+str(threshold)+", tp="+str(count_tp)+", fp="+str(count_fp), ", tf="+str(count_tf), ", acc="+str(acc))

When the accuracy is graphed against the ROC curve, this looks like:

roc-acc
FIGURE 6. ROC thresholds vs Accuracy.

Eyeballing this, at the very left of FIGURE 6 the accuracy is equal to predicting all non-survivors correct (175/262=66.8%) but all survivors wrong. At the very right the accuracy is equal to predicting all survivors correct but all non-survivors wrong (87/262=33.2%). The sweet point is somewhere around 83% accuracy with about 65% true positives and maybe 8% false positives. I am just eyeballing this from the graph, so don’t take it to literally. To get a sorted list of accuracies by threshold:

plt.figure(figsize=(10, 6), dpi=80)
plt.plot(fpr3, tpr3, color='blue', label='ROC curve (area = %0.3f)' % auc(fpr3, tpr3))
plt.plot(fpr3, accuracies, color='coral', label='Accuracy' % auc(fpr3, tpr3))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize=14)
plt.ylabel('True Positive Rate', fontsize=14)
plt.title('ROC test')
plt.legend(loc="lower right")
plt.show()

And the sorted list itself:

accs_ths.sort(reverse=True)
accs_ths

results in:

  accuracy,           threshold,          tp, tn,  total correct
 (0.8396946564885496, 0.5802742149381693, 58, 162, 220),
 (0.8396946564885496, 0.5667318576914661, 59, 161, 220),
 (0.8358778625954199, 0.5788538683111758, 58, 161, 219),
 (0.8358778625954199, 0.5493707899389861, 60, 159, 219),
 (0.8358778625954199, 0.5246916651561455, 61, 158, 219),
 (0.8320610687022901, 0.6342464019774986, 52, 166, 218),
 (0.8320610687022901, 0.6186895526474085, 53, 165, 218),
 (0.8320610687022901, 0.6017220679887019, 54, 164, 218),
 (0.8320610687022901, 0.5817737394188721, 56, 162, 218),
 (0.8320610687022901, 0.549863313475955, 59, 159, 218),
 (0.8320610687022901, 0.548822920002867, 60, 158, 218),
 (0.8320610687022901, 0.5042661255726298, 62, 156, 218),
 (0.8320610687022901, 0.5000000000000000, 62, 156, 218),
 (0.8282442748091603, 0.6300163468361882, 52, 165, 217),
 (0.8282442748091603, 0.6147908791071057, 53, 164, 217),
 (0.8282442748091603, 0.5158433833082536, 61, 156, 217),
 (0.8282442748091603, 0.4778998233729275, 63, 154, 217),
 (0.8244274809160306, 0.5922584447936917, 54, 162, 216),
 (0.8244274809160306, 0.4801401821434857, 62, 154, 216),
 ...

Here a number of thresholds actually give a higher score than the 0.5 one I used as the reference for default classifier. The highest scoring thresholds being 0.5802 (58.02%) and 0.5667 (56.67%). The 0.5 threshold gets 218 predictions correct, with an accuracy of 0.8320 (matching the one from accuracy_score() at the beginning of this post). Looking at the thresholds and accuracies more generally, the ones slightly above the 0.5 threshold generally seem to do a bit better. Threshold over 0.5 but less than 0.6 seems to score best here. But there is always an exception, of course (0.515 is lower). Overall, the difference between the values is small but interesting. Is it overfitting the test data when I change the threshold and evaluate against the test data? If so, does the same consideration apply for optimizing for precision/recall using the ROC curve? Is there some other reason why the classifier would use 0.5 threshold which is not optimal? Well, my guess is, it might be a minor artefact of over/underfitting. No idea, really.

Area Under the Curve (AUC)

Before I forget. The title of the post was ROC/AUC. So what is area under the curve (AUC)? The curve is the ROC curve, the AUC is the area under the ROC curve. See FIGURE 7, where I used the epic Aseprite to paint the area under the FIGURE 5 ROC curve. Brilliant. AUC refers to the area covered by this part I colored, under the ROC curve. The value of AUC is calculated as the fraction of the overall area. So consider the whole box of FIGURE 7 as summing up to 1 (TPR x FPR = 1 x 1 = 1), and in this case AUC is the part marked in the figure as area = 0.877. AUC is calculated simply by calculating the size of the area under the curve vs the full box (at full size 1).

roc5_aucFIGURE 7. Area Under the Curve.

I believe the rationale is, the more the colored area covers, or the bigger the AUC value, the better overall performance one could expect from the classifier. As the area grows bigger, the more the classifier is able to separate true positives from false positives at different threshold values.

Uses for AUC/ROC

To me, AUC seems most useful in evaluating and comparing different machine learning algorithms. Which is also what I recall seeing it being used for. In such cases, the higher the AUC, the better overall performance you would get from the algorithm. You can then boast in your paper about having a higher overall performance metric than some other random choice.

ROC I see as mostly useful for providing extra information and an overview to help evaluate options for TPR vs FPR in different threshold configurations for a classifier. The choice and interpretation depends on the use case, of course. The usual use case in this domain is doing a test for cancer. You want to maximize your TPR so you miss out on fewer people with cancer. You can then look for your optimal location of the ROC curve to climb onto, with regards to the cost vs possibly missed cases. So you would want a high TPR there, as far as you can afford I guess. You might have a higher FPR but such is the tradeoff. In this case, the threshold would likely be lower rather than higher.

It seems harder to find examples of optimizing for low FPR with the tradeof being lower TPR as well. Perhaps one could look onto the Kaggle competitions, and pick, for example, the topic of targeted advertising. For lower FPR, you could set a higher threshold rather than lower. But your usecase could be pretty much anything, I guess.

Like I said, recently I have been looking into Spark and some algorithms seem to only give out the estimation as an AUC metric. Which is a bit odd but I guess there is some clever reason for that. I have not looked too deep into that aspect of Spark yet, probably I am just missing the right magical invocations to get all the other scores.

Some special cases are also discussed, for example, in the Fawcett paper. At some point in the curve, one classifier might have a higher point in ROC space even if having overall lower AUC value. So that some threshold would have higher value on one classifier, while other thresholds lower for the same (classifier) pair. Similarly, AUC can be higher overall but a specific classifier still better for a specific use case. Sounds a bit theoretical, but interesting.

Why is it called Receiver Operator Characteristic (ROC)?

You can easily find information on the ROC curve history referencing their origins in world war 2 and radar signal detection theory. Wikipedia ROC article has some start of this history, but is quite short. As usual, Stackexchange gives a good reference. The 1953 article seems paywalled, and I do not have access. This short description describes it as being used to measure the ability of a radio receiver to produce quality readings and enabling the operator to distinguish between false positives and true positives.

Elsewhere I read it originated from Pearl Harbour attack during WW2, where they tried to analyze why the radar operators failed to see incoming attack aircraft. What do I know. The internet is full of one-liner descriptions on this topic, all circling around the same definitions but never going into sufficient details.

Conclusions

Well the conclusion is that this was way too long post on a simple term. And trying to read it, it is not even that clearly written. But that is what I got. It helped me think about the topic and clarify and verify what it really is. Good luck.