Tag Archives: testing

Testing Machine Learning Intensive Systems (or Self-Driving Cars) – A Look at the Uber Accident

Previously I looked at what it means to test machine learning systems, and how one might use machine learning in software testing. Most of the materials I found on testing machine learning systems was academic in nature, and as such a bit lacking in practical views. Various documents on the Uber incident (fatally hitting a pedestrian/cyclist) have been published, and I had a look at those documents to find a bit more insights into what it might mean to be testing overall systems that rely heavily on machine learning components. I call them machine-learning intensive systems. Because.

Accident Overview

There are several articles published on the accident, and the information released for it. I leave the further details and other views for those articles, while just trying to find some insights related to the testing (and development) related aspects here. However, a brief overview is always in order to set the context.

This accident was reported to have happened on 18th of March, 2018. It involved the Uber test car (a modified Volvo X90) hitting a person walking with a bicycle. The person died as a result of the impact. This was on a specific test route that Uber used for their self-driving car experiments. There was a vehicle operator (VO) person in the Uber car, whose job was to oversee the autonomous car performance, mark any events of interest (e.g., road debris, accidents, interesting objects), label objects in the recorded data, and take over control of the vehicle in case of emergency or other need (system unable to handle some situation).

The records indicate there used to be two VO’s per car, one focusing more on the driving, and one more on recording events. They also indicate that before the accident, the number of VO’s had been reduced to just one. The roles were combined and an update of the car operating console was designed to address the single VO being able to perform both roles. The lone VO was then expected to keep a constant eye on the road, monitor everything, label the data, and perform any other tasks as needed. Use of mobile phone was prohibited during driving by the VO, but the documents for the accident indicate the VO had been eyeing the spot where their mobile device was located, inside a slot on the dashboard. The documents also indicate the VO had several video streaming applications installed, and records from the Hulu streaming service showed video streaming occurring on the VO account at the time of the accident.

The accident itself was a result of many combined factors, where the human VO seems to have put their attention elsewhere at just the time, and the automation system has failed to react properly to the pedestrian / cyclist. Some points on the automation system in relation to potential failures:

  • The system kept records of each moving object / actor and their movement history, using their previous movements and position as an aid to predict their future movements. The system was further designed to discard all previous movement (position) history information when it changed the classification of an object. So no history was available to predict movement of an object / actor, if its classification changed.
  • The classification of the pedestrian that was hit kept changing multiple times before the crash. As a result, the system constantly discarded all information related to it, severely inhibiting the system from predicting the pedestrians movement.
  • The system had an expectation to not classify anything outside a road crossing as a pedestrian. As such, before the crash, the system continously changed the pedestrian classification between vehicle, bicycle, or other. This was the cause of losing the movement history. The system was not designed for the possibility of someone walking on the road outside a crossing area.
  • The system had safeguards in place to stop it from reacting too aggressively. A delay of 1 second was in place to delay braking when a likely issue was identified. This delayed automatic braking even at a point where a likely crash was identified (as in the accident). The reasoning was to avoid too aggressive reactions to false positives. I guess they expected the VO to react, and to log issues for improvement.
  • Even when the danger was identified and automated braking started, it was limited to reasonable force to avoid too much impact on the VO. If this maximum braking was calculated as insufficient to avoid impact, the system would brake even less and emit an audio signal for the VO to take over. So if maximum is not enough, slow down less(?). And the maximum emergency braking force was not set very high (as far as I understand..).

Before the crash, the system thus took an overly long time to identify the danger due to bad assumptions (no pedestrian outside crossing). It lost pedestrian movement history due to dropping data on classification change. It waited for 1 second from finally identifying the danger to do anything, and then initiated a slowdown rather than emergency braking. And the VO seemed to be distracted from observing the situation. After the accident, Uber has moved to address all these issues.

There are several other documents on various aspects of the VO, the automation system, and the environment available on the National Transportation Safety Board website for those interested. Including nice illustrations of all aspects.

This was a look at the accident and its possible causes to give some context. Next a look at the system architecture to also give some context of potential testing approaches.

Uber System Architecture

Looking at testing in any domain, understanding the system architecture is important. A look.

Software Modules

The Uber document on the topic lists the following main software modules:

  • Perception: Collects data from different sensors around the car
  • Localization: Combines detailed map data with sensor data for accurate positioning
  • Prediction: Takes Perception output as input, predicts actions for actors and objects in the environment.
  • Routing and Navigation: Uses map data, vehicle status, operational activity to determine long term routes for a given goal.
  • Motion Planning: Generates shorter term motion plans to control the vehicle in the now. Based on Perception and Prediction inputs.
  • Vehicle Control: Executes the motion plan using vehicle communication interfaces.

Hardware

The same Uber document also describes the self-driving car hardware.

The current components at the time of writing the document:

  • Light Detection and Ranging (LIDAR): Measuring distance to actors and objects, 100m+ range.
  • Cameras: multiple cameras for different distances, covering 360 degrees around the vehicle. Both near- and far-range. To identify people and objects.
  • Radar: Object detection, ranging, relative velocity of objects. Forward-, backward-, and side-facing.
  • Global Positioning System (GPS): Coarse position to support vehicle localization (positioning it), vehicle command (to use location / position for control), map data collection, satellite measurements.
  • Self-Driving Computer: A liquid-cooled local computer in the car to run all the SW modules (Perception, Prediction, Motion Planning, …)
  • Telematics: Communication with backend systems, cellular operator redundancy, etc.

Planned components (not installed back then, but in future plans..):

  • Ultrasonic Sensors: Uses echolocation to range objects. Front, back, and sides.
  • Vehicle Interface Module: Seems to be an independent backup module to safely control and stop the vehicle in case of autonomous system faults.

Functionality

Now that we established a list of the SW and HW components, a look at their functionality.

Mapping

The system is described as using very detailed maps, including:

  • Geometry of the road and curbs
  • Drivable surface boundaries and driveways
  • Lane boundaries, including paint lines of various types
  • Bike and bus lanes, parking regions, stop lines, crosswalks
  • Traffic control signals, light sets, and lane and conflict associations (whatever that is? :))
  • Railroad crossings and trolley or railcar tracks
  • Speed limits, constraint zones, restrictions, speed bumps
  • Traffic control signage

Combined with precise location information, the system uses these detailed maps to beforehand "predict" what type of environment lies ahead, even before the Perception module has observed it. This is used to prepare for the expected road changes, anticipate speed changes, and optimize for expected motion plans. For example, when anticipating a tight turn in the road.

Perception and Prediction

The main tasks of the Perception module are described as detecting the environment, actors and objects. It uses sensor data to continously estimate the speed, position, orientation, and other variables of the objects and actors, as a basis to make better predictions and plans about their future movement, velocity, and position.

An example given is the turn signals of other cars, which is used to predict the their actions. At the same time, all the other data is also recorded and used to predict other, alternative courses for the same car, in case it does not turn even though using a turning signal.

While the Perception module observes the environment (collects sensor data), the Prediction component uses this, and other available, data as a basis for predicting the movement of the other actors, and changes in the environment.

The observed environment can have different types of objects and actors in it. Some are classified as fixed structures, and are expected not to move: buildings, ground, vegetation. Others are classified as more dynamic actors, and expected to move: vehicles, pedestrians, cyclists, animals.

The Prediction module makes predictions on where each of these objects is likely to move in the next 10 seconds. The predictions include multiple properties for each object and actor, such as movement, velocity, future position, and intended goal. The intended goal (or intention) is mentioned in the document, but I did not find a clear description of how this would be used. In any case, it seems plausible that the system would assign "intents" to objects, such as pedestrian crossing a street, a car turning, overtaking another car, going straight, and so on. At least these would seem useful abstractions and input to the next processing module (Motion Planning).

The Prediction module makes predictions multiple times a second to keep an updated representation available. The predictions are provided as input to the Route and Motion Planning module, including the "certainty" of those predictions. This (un)certainty is another factor that the Motion Planning module can use as input to apply more caution to any control actions.

Route and Motion Planning

Motion Planning (as far as I understand) refers to short-term movements, translating to concrete control instructions for the car. Route planning on the other hand refers to long term planning on where to go, and gives goals for the Motion Planning to guide the car to the planned route.

Motion Planning combines information from generated route (Route Planning), perceived objects and actors (Perception), and predicted movements (Prediction). Mapping data is used for the "rules of the road", as well as any active constraints. I guess also combined with sensor data for more up-to-date views in the local environment (the public docs are naturally not super-detailed on everything). Using these, it creates a motion plan for vehicle. Data from Perception and Prediction modules is also used as input, to define the anticipated movements of other objects and actors.

A spatial buffer is defined to be kept between the vehicle and other objects in the environment. My understanding is that this refers to keeping some amount of open space between the car and environmental elements. The size of this buffer varies with variables such as autonomous vehicle speed (and properties and labels of other objects and actors I assume). To preserve the required buffer, the system may take action such as changing lanes, brake, or stop and wait for situation to clear.

The system is also described as being able to identify and track occlusions in the environment. These would be environmental elements, such as buildings or other cars, blocking a view to certain other parts of the environment. These are constantly reasoned about, and the system becomes more concervative in decision when occlusions are observed. It aims to be able to avoid actors coming out of occlusions with reasonable speed.

Vehicle Control

The Vehicle Control module executes trajectories provided by the Motion Planning module. It controls the vehicle through communication interfaces. Example controls include steering, braking, turn signals, throttle, and switching gears.

It also tracks any limits set for the system (or environment?), and communicates back to the operation center as needed.

Data Collection and Test Scenarios

Since my point with this "article" was to look into what it might mean to test a machine learning intensive system, I find it important to also look at what type of data is used to train the machine learning systems, and how is all the used data collected. And how these are used as part of test case (In the Uber documents they seems to call them test scenarios). Of course, such complex systems use this type of data for many different purposes besides just the machine learning part, so they are generally interesting as well.

The Uber document describes data uses including system performance analysis, quality assurance, machine teaching and testing, simulated environment creation and validation, software development, human operator training and assessment, and map building and validation.

Data Collection

Summarizing the various parts related to data collection and synthesis from the Uber descriptions, at the heart of all this are the real-world training data collected by the VO’s driving around, the car and automated sensors collecting detailed data, and the VO’s tagging the data. This tagging also helps further identify new scenarios, objects, and actors. The sensor data is based on the sensors I listed above in the HW section.

Additionally, the system is listed as recording:

  • telemetry (maybe refers to metrics about network? or just generally to transferring data?)
  • control signals (commands for vehicle control?)
  • Control Area Network (CAN) messages
  • system health, such as
    • hard drive speeds
    • internal network performance
    • computer temperatures

The larger datasets are recorded in onboard (car) storage. Smaller amounts of data are transmitted in near real-time using over-the-air (OTA) interfaces over cellular networks to the Uber control center. These use multiple cellular network for cybersecurity and resiliency purposes. The OTA data includes insights on how the vehicles are performing, where they are, and their current state.

Scenario Development

In the documents (Uber and another from the RAND corporation), the operational environment of the autonomous vehicle is referred to as the operational design domain (ODD). Defining the ODD is quite central to the development (as well as testing) of the system and training the ML algorithms, as well as the controlling logic based on those. It defines the world in which the car operates, and all the actors and objects, and their relations.

The Uber document describes using something called scenarios as test cases. Well, it mostly does not mention the word "test case", but for practical purposes this seems to be similar. Of course, this is quite a bit more complex than a traditional software test case with simple inputs and outputs, requiring description of complex real-world environments as inputs, and boundaries and profiles of accepted behaviour as outputs, rather than specific data values. These complex real-world inputs and outputs are also varying over time, different from the typical static input values as often is with traditional software tests. Thus, also a time-series aspect is relevant to the inputs and outputs.

Uber describes a unified schema being used to describe the scenarios and data. Besides the collected data and learned models, other data inputs are also used, such as operational policies. Various success criteria are defined for each scenario, such as speed, distance, and description of safe behaviour.

When new actors, environmental elements, or other similar items are encountered, they are recorded and tagged for further training of the autonomous system. The resulting definitions and characterization of the ODD is then used as input to improve the test scenarios and create new ones. This includes improving the test simulations, and test tracks for coverage.

Events such as large deviations between consequtive planned trajectories are recorded and automatically tagged for investigation. Simulations are used to evaluate they are fixed, and the new scenarios are added to ML training datasets, or as "hard test cases". This seems a bit similar to the Tesla "shadow mode" I discussed earlier, just a bit more limited.

Test Coverage

Besides a general overview of the scenario development, the Uber documents do not really discuss how they handle test coverage, or what types of tests they run. There are some minor references but nothing very concrete. It is more focused on describing the overall system, and some related processes. I tried to collect here some points that I figured seemed relevant.

A key difference to more traditional software systems seems to be how these types of systems do not have a clearly defined input or output space. The interaction interfaces (API/GUI) of traditional software systems naturally defines some contract for what type of input is allowed, and expected. With these it is possible to apply the traditional techniques such as category partitioning, boundary analysis, etc. When your input space is the real world and everything that can happen in it, and output space is all the possible actions in relation to all the possible environmental configurations, it gets a bit more complex. In a similar comment, Uber describes their system as requiring more testing with different variations.

Potential Test Scenarios from Uber Docs

These are just points I collected that I though would illustrate something related to test scenarios and test coverage.

Uber describes evaluating their system performance in different common and rare scenarios, using measurements such as traffic rule violations, and vehicle dynamic attributes. This means having very few crash and unsafe scenarios available, but a large number of safe scenarios. That is, when the scenarios are based on real-world use and data, commonly there are much more "safe" scenarios available than "un-safe", due to rarity of crashes, accidents and other problem cases vs normal operations.

With only this type of highly biased data-set available, I expect there is a need to synthesize more extensive test sets, or other methods to test and develop such systems more extensively. The definition of safety also does not seem to be a binary decision but rather there can be different scales of "safe", depending on the safety attribute. For example, a safety margin of how far from other vehicles should the autonomous vehicle keep distance, is a continous variable, not a binary value. Some variables might of course have binary representations, such as avoiding hitting a pedestrian, or ramming a red light. But even the pedestrian metric may have similar distance measures, impact measures, etc. So I guess its a bit more complicated than just safe or not safe.

Dataset augmentation and imbalanced datasets are common issues in developing and training ML models. However, those techniques are (to my understanding) based on a single clear goal such as classification of an object, not on complex output such as overall driving control and its relation to the real world. Thus, I would expect to use overall scenario augmentation type of approaches, more holistic than a simple classifier (which in its own might be part of the system).

Some properties I found in the Uber documents (as I discussed above), referring to potential examples of test requirement:

  • Movement of objects in relation to vehicle.

  • Inability of the system to classify a pedestrian correctly if not near a crosswalk.

  • Inability of the system to predict pedestrian path correctly when not classified as pedestrian.

  • Overly strict assumptions made, such as cyclist not moving across lanes.

  • Losing location history of tracked objects and actors if their classification changed

  • Uber defines test coverage requirements based on collected map data and tags.

  • Map data predictin that upcoming environment would be of specific type (e.g., left curve), but it has changed and observations differ

  • Another car signals turning left but other predictors do not predict that, and the other car may not actually turn left.

  • Certainty of classifications.

  • Occlusions in the environment.

Abstracting

Looking at the above examples, trying to abstract some more generic concepts that would serve as a potentially useful basis:

  • Listing of known objects / actors

  • Listing of labels for different types of objects / actors

  • Assumptions made about specific types of objects / actors

  • Properties of objects / actors

  • Interaction constraints of objects and actors

  • Probabilities of classifications for different objects / actors and labels

  • Functionality when faced with unknown objects / actors

The above list may be lacking in general details that would cover the different types of systems, or even the Uber example, but I find it provides an insight into how this is heavily about probabilities, uncertainty, and preparing for, and handling, that uncertainty.

For different types of systems, the actual objects, actors, labels and properties would likely change. To illustrate these a bit more concretely with the autonomous car example:

  • Objects / Actors, and Properties their Labels

    • Our car
      • Speed, Position, Orientation,
      • Accelerating, Slowing down,
      • Intended goal (turn left, drive forward, change lane, stop, …)
      • Predicted location in 1s, 2s, 5s, …
      • Distance to all other actors / objects
      • Right of way
    • Other car, moving
      • Same as "Our car"
    • Other car, parked
      • Probability of leaving parking mode
    • Pedestrian, moving or stopped (parked)
      • Same as "other car"
      • Crossing street
      • On pedestrian path
    • Cyclist
      • Same as "other car"
      • Crossing street
      • On bicycle path
    • Other object
      • Moving or static
      • Same as others above
    • Traffic light
      • Current light (green, yellow, red)
      • On / off / blinking
    • Traffic sign
      • Type / Meaning
        • Set speed, stop, yield, no parking, …
        • Long term / local effect
    • Building
      • Size, shape, location
    • Occlusion
      • Predicted time of object / actor coming from occlusion
    • Unknown object, moving or parked
      • Much like the other car etc but maybe with unknown goals
  • Interaction constraints

    • Safety margin (distance to our car and other actors) before triggering some action
    • Actions triggered in different constraint states / boundaries

Something that seems important is also the ability to reason about previously unknown objects and actors to an extent possible. For example, a moving object that does not seem to fit any known category, but has known movement history, speed, and other variables. Perhaps there would be a more abstract category of a moving object, or some hierarchy of such categories. As well as the any of these objects or actors changing their classifications and goals, and how their long-term history should be taken into account overall to make future predictions.

In a different "machine learning intensive" system (not autonomous cars), one might use different set of properties, actors, object, etc. But it seems some similar consideration could be useful.

Possible Test Strategies

Once the domain (the "ODD") is properly defined, as above, it seems many traditional testing techniques could be applied. In the Uber documents, they describe performing architecture analysis to identify all potential failure points. They divided faults into three levels: faults in the self driving system on its own, faults in relation to the environment (e.g., at intersections), and faults related to the operational design domain, such as unknown obstacles the system does not recognize (or misclassifies?). This could be another way to categorize a more specific system, or inspiration for other similar systems.

Another part of this type of system could be related to the human aspect. This is somewhat discussed also in the Uber docs, in relation to operational situations for the system: a distracted operator, and a fatigued operator. They have some functionality in place (especially after the accident) to monitor operator alertness via in-car dashcam and attached analysis. However, I will not go into these here.

Testing ML Components

For testing the ML components, I discussed various techniques in a previous blog post. This includes approaches such as metamorphic testing, adversarial testing, and testing with reference inputs. In autonomous cars, this might be visual classifiers (e.g., convolutional networks), or path prediction models (recurrent neural nets etc.), or something else.

Testing ML Intensive Systems

As for the set of properties I listed above, it seems once these have been defined, using traditional testing techniques should be quite useful:

  • combinatorial testing: combine different objects / actors, with different properties, labels, etc. observe the system behaviour in relation to the set constraints (e.g., safety limits).
  • boundary analysis: apply to the combinations and constraints from the previous bullet. for example, probabilities at different values. might require some work to define interesting sets of probability boundaries, or ways to explore the (combined) probability spaces. but not that different in the end from more traditional testing.
  • model-based testing: use the above type of variables to express the system state, use a test generator to build extensive test sets that can be used to cover combinations, but also transitions between states and their combinations over time.
  • fault-injection testing: the system likely uses data from multiple different data sources, including numerous different types of sensors. different types of faults in these may have different types of impact on the ML classifier outputs, overall system state, etc. fault-injection testing in all these elements can help surface such cases. think Boeing Max from recent history, where a single sensor failure caused multiple crashes with hundreds of lives lost.

The real trick may be in combining these into actual, complete, test scenarios for unit tests, integration tests, simulators, test tracks, and real-world tests.

Regarding the last bullet above (fault-injection testing), the Uber documents discuss this from the angle of fault-injection training – injecting faults into the system and seeing how the vehicle operator reacts to them. Training them how they should react. This sounds similar to fault-injection testing, and I would expect that they would have also applied the same scenarios more broadly. However, I could not find mention of this.

Regarding general failures, and when they happen in real use, the same fault models can also be used to prepare and mitigate actual operational faults. The Uber docs also discuss this viewpoints as the system having a set of identified fault conditions and mitigations when these happen. These are identified by redundant systems and overall monitoring across the system. Example faults:

  • Primary compute power failure
  • Loss of primary compute or motion planning timeout
  • Sensor data delay
  • Door open during driving

General Safety Proceduress

Volvo Safety Features

Besides the Uber self-driving technology, the documents show Volvo cars having safety features in themselves, an Advanced Driver Assistance Systems (ADAS), including an automated emergency braking system named "City Safety". It contains a forward collision warning system, alerting the driver about imminent collision and automatically applying brakes when it observes a potentially dangerous situation. This also includes pedestrian, cyclist, and large animal detection components. However, these were turned off during autonomous driving mode, and only active in manual mode. Simulation tests conducted by the Volvo Group showed how the ADAS features would have been able to avoid the collision (17 times out of 20) or significantly reduce collision speed and impact (remaining 3 times). In post-crash changes, the ADAS system has been activated at all times (along with many other fixes to all the issues discussed here).

Information Sharing and Other Domains

The documents on reviews and investigations after the accident include comparisons to safety cultures in many other (safety-critical) domains: Nuclear Power, Transportation (Rail), Aviation, Oil and Gas, and Maritime. While some are quite specific to the domains, and related to higher level process and cultural aspects, there seem to be many quite interesting points one could build on also for the autonomous driving domain. Or other similar ones. Safety has many higher level shared aspects across domains. Regarding my look for testing related aspects, in many cases replacing "safety" with "QA" would also seem to provide useful insights.

One practical example is how (at least) avionics and transportation (rail) domains have processes in place to collect, analyze, and share information on unsafe conditions and events observed. This would seem like a useful way to identify also relevant test scenarios for testing products in the autonomous driving domain. Given how much effort is required for extensive collection of such data, how expensive and dangerous it can be, the benefits seem quite obvious to everyone.

Related to this, Uber discusses shared metrics for evaluating progress of their development. These include disengagements and self-driving miles travelled. While they have used these to signal progress both internally and externally, they also note that such metrics can easily lead to "gaming the system" at the expense of safety or working system. For example, in becoming overly conservative in avoiding disentanglements, or in using inconsistent definitions of the metrics across developers / systems.

Uber discusses need for work in creating more broadly usable safety performance metrics with academic and industry partners. They list how these metrics should be:

  • Specific to different development stages (development, testing, deployment)
  • Specific to different operational design domains, scenarios and capabilities
  • Have comparable metrics for human drivers
  • Applied in validation environments and scenarios for autonomous cars with other autonomous cars from different companies

The Uber safety approach document refers also a more general work towards automotive safety framework by the RAND corporation. This includes topics such as building a shared taxonomy to form a basis for discussion and sharing across vendors. It also discusses safety metrics, their use across vendors, and the possible issues in use and possible gaming of such metrics. And many other related aspects of cross-vendor safety program. Interesting. Seems like lots of work to do there as well.

Conclusions

This was an overly long look of the documents from the Uber accident. I was thinking of just looking at the testing aspect briefly, but I guess it is hard to discuss them properly without setting the whole background and overall context. Overall, the summary is not that complicated. I just get carried away with writing too much details.

However, I found writing this down helped me reason better about what is the difference between more traditional software intensive systems, and these types of new machine-learning intensive systems. I would summarize it as the need to consider everything in terms of probabilities, the unknown elements in the input and output, constraints over everything, complexity of identifying all the objects and actors, and their possible intents, and all the relations between all possibilities. With probabilities (or un-certainty). But once the domain analysis is well done, and understanding the inputs and outputs, I find the traditional testing techniques such as combinatorial testing, model-based testing, category partitioning, boundary analysis, fault-injection testing would give a good basis. But it might take a bit broader insight to be able to apply them efficiently.

As for the Uber approach, it is interesting. I previously discussed the Tesla approach of collecting data from fleets of deployed consumer vehicles. And features such as the Tesla shadow mode, continuously running in the background as the human driver drives, always evaluating whether each autonomous decision the system would have made would have been similar to what action the human took, or how it differs from that taken by the actual human driver. Not specifically trained VO’s as in the Uber case, but usual consumer drivers (so Tesla customers at work helping to improve the product).

The Tesla approach seems much more scalable in general. It might also generalize better as opposed to Uber aiming for very specific routes and building super detailed maps of just those areas. Creating and maintaining such super-detailed maps seems like a challenging task. Perhaps if the companies have very good automated tools to take care of it, it can be easier to manage and scale. I don’t know if Tesla does some similar mapping with the help of their consumer fleet, but would be interesting to see similar documents and compare.

As for other types of machine learning (intensive) systems, there are many variations, such as those using IoT sensors and data to provide a service. Those are maybe not as open-worlded in all possible input spaces. However, it would seem to me that many of the considerations and approaches I discussed here could be applied. Probabilities, (un-)certainties, domain characterizations, relations, etc. Remains interesting to see, perhaps I will find a chance to try someday.. 🙂

Robot Framework by Examples

Introduction

Robot Framework (RF) is a popular keyword driven test framework (at least in Finland it seems to be..). Recently had to look into it again for some potential work related opportunities. Have to say open source is great but the docs could use improvements..

I made a few examples for the next time I come looking:

Installing

To install RF itself, in Python pip does the job. Installing RF itself, along with Selenium keywords, and Selenium Webdriver for those keywords:

pip3 install robotframework
pip3 install selenium
pip3 install robotframework-seleniumlibrary

Using Selenium WebDriver as an example here, a Selenium driver for the selected browser is needed. For Chrome, one can be downloaded from the Chrome website itself. Similarly for other browsers on their respective sites. The installed driver needs to be on the search path for the operating system. On macOS, this is as simple as adding it to the path. Assuming the driver is in currect directory:

PATH=.:$PATH

So just the dot, which works as long as the driver file is in the working directory when running the tests.

In PyCharm, the PATH can also be similarly added to run configuration environment variables.

General RF Script Structure

RF script elements are separated by minimum of 2 space indentation. Both indenting test steps under a test, and also to separate keywords and parameters. There is also the pipe separated format which might look a bit fancier, if you like. Sections are identified by three stars *** and a pre-defined name for the section.

The following examples illustrate.

Examples

Built-in Keywords / Logging to console

The built-in keywords are avaiable without needing to import a specific library. Rather they are part of the built-in library. Simple example of logging some statement to console:

The .robot script (hello.robot in this case):

*** Test Cases ***
Say Hello
    Log To Console    Hello Donkey
    No Operation
    Comment           Hello ${bob}

The built-in keyword "Log To Console" writes the given parameter to the log file. A hello world equivalent. To run the test, we can either write code to invoke the RF runner from Python or use RF command line tools. Python runner example:

from robot.running import TestSuiteBuilder
from robot.api import ResultWriter

#https://robot-framework.readthedocs.io/en/3.0/autodoc/robot.running.html
suite = TestSuiteBuilder().build("./hello.robot")
result = suite.run(output="test_output.xml")
#ResultWriter(result).write_results(report='report.html', log="log.html")
ResultWriter("test_output.xml").write_results(report='report.html', log="log.html")

The "hello.robot" in above is the name of the test script file listed above also.

The strangest thing (for me) here is the writing of the log file. The docs suggest to use the first approach I commented out above. The ResultWriter with the results object as a parameter. This generates the report.html and the log.html.

The problem is, the log.html is lacking all the prints, keywords, and test execution logs. Later on the same docs state that to get the actual logs, you have to pass in the name of the XML file that was created by the suite.run() method. This is the uncommented approach in the above code. Since the results object is also generated from this call, why does it not give the proper log? Oh dear. I don’t understand.

Commandline runner example:

robot hello.robot

This seems to automatically generate an appropriate log file (including execution and keyword trace). There are also a number of command line options available, for all the properties I discuss next using the Python API. Maybe the general / preferred approach? But somehow I always end up needing to do my own executors to customize and integrate with everything, so..

Finally on logging, Robot Framework actually captures the whole stdout and stderr, so statements like print() get written to the RF log and not to actual console. I found this to be quite annoying and resulting in overly verbose logs with all the RF boilerplate/overhead. There is a StackOverflow answer on how to circumvent this though, from the RF author himself. I guess I could likely write my own keyword to use that if needed to get more log customization, but seems a bit complicated.

Tags and Critical Tests

RF tags are something that can be used to filter and group tests. One use is to define some tests as "critical". If a critical test fails, the suite is considered failed.

Example of non-critical test filtering. First, defining two tests:

*** Test Cases ***
Say Hello Critical
	[Tags]            crit
    Log To Console    Hello Critical Donkey
    No Operation
    Comment           Hello ${bob}

Say Hello Non-Critical
	[Tags]            non-crit
    Log To Console    Hello Nice Donkey
    No Operation
    Comment           Hello ${bob}

Running them, while filtering with wildcard:

from robot.running import TestSuiteBuilder
from robot.api import ResultWriter

#https://robot-framework.readthedocs.io/en/3.0/autodoc/robot.running.html
suite = TestSuiteBuilder().build("./noncritical.robot")
result = suite.run(output="test_output.xml", noncritical="*crit")
ResultWriter("test_output.xml").write_results(report='report.html', log="log.html")

The above classifies all tests that have tags matching the regexp "*crit" as non-critical. In this case, it includes both the tags "crit" and "non-crit", which would likely be a bit wrong. So the report for this actually shows 2 non-critical tests.

The same execution with a non-existent non-critical tag:

from robot.running import TestSuiteBuilder
from robot.api import ResultWriter

#https://robot-framework.readthedocs.io/en/3.0/autodoc/robot.running.html
suite = TestSuiteBuilder().build("./noncritical.robot")
#this tag does not exist in the given suite, so no critical tests should be listed in report
result = suite.run(noncritical="non")
ResultWriter(result).write_results(report='report.html', log="log.html")

This runs all tests as critical, since no test has a tag of "non". To finally fix it, the filter should be exactly "non-crit". This would not match "crit" but would match exactly "non-crit".

Filtering / Selecting Tests

There are also keywords include and exclude. To include or exclude (surprise) tests with matching tags from execution.

A couple of tests with two different tags (as before):

*** Test Cases ***
Say Hello Critical
	[Tags]            crit
    Log To Console    Hello Critical Donkey
    No Operation
    Comment           Hello ${bob}

Say Hello Non-Critical
	[Tags]            non-crit
    Log To Console    Hello Nice Donkey
    No Operation
    Comment           Hello ${bob}

Run tests, include with wildcard:

from robot.running import TestSuiteBuilder
from robot.api import ResultWriter
from io import StringIO

#https://robot-framework.readthedocs.io/en/3.0/autodoc/robot.running.html
suite = TestSuiteBuilder().build("./include.robot")
stdout = StringIO()
result = suite.run(include="*crit", stdout=stdout)
ResultWriter(result).write_results(report='report.html', log="log.html")
output = stdout.getvalue()
print(output)

This includes both of the two tests defined above, since the tags match. If the filter was "non", nothing would match, and error is produced for no tests to run.

Creating new Keywords from Existing Keywords

Besides somebody elses keywords, custom keywords can be extended from existing keywords. Example test file:

*** Settings ***
Resource    simple_keywords.robot

*** Test Cases ***
Run A Google Search
    Search for      chrome    emoji wars
    Sleep           10s
    Close All Browsers

The included (by the Resource keyword above) file simple_keywords.robot:

*** Settings ***
Library  SeleniumLibrary

*** Keywords ***
Search for
    [Arguments]    ${browser_type}    ${search_string}
    Open browser    http://google.com/   ${browser_type}
    Press Keys      name:q    ${search_string}+ENTER

So the keyword is defined above in a separate file, with arguments defined using the [Arguments] notation. Followed by the argument names. Which are then referenced in following keywords, Open Browser and Press Keys, imported from SeleniumLibrary. Simple enough.

Selenium Basics on RF

Due to popularity of Selenium Webdriver and testing of web applications, there is a specific RF library with keywords built for it. This was installed way up in Installing section.

Basic example:

*** Settings ***
Library  SeleniumLibrary

*** Test Cases ***
Run A Google Search
    Open browser    http://google.com/   Chrome
    Press Keys      name:q    emoji wars+ENTER
    Sleep           10s
    Close All Browsers

Run it as always before:

from robot.running import TestSuiteBuilder
import robot

#https://robot-framework.readthedocs.io/en/3.0/autodoc/robot.running.html
suite = TestSuiteBuilder().build("./google_search.robot")
result = suite.run()

This should open up Chrome browser, load Google on it, do a basic search, and close the browser windows. Assuming it founds the Chrome driver also listed in the Installing section.

Creating New Keywords in Python

Besides building keywords as composites of existing ones, building new ones with Python code is an option.

Example test file:

*** Settings ***
Library         google_search_lib.py    chrome

*** Test Cases ***
Run A Google Search
    Search for      emoji wars
    Sleep           10s
    Close

The above references google_search_lib.py, where the implementation is:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

class google_search_lib(object):
    driver = None

    @classmethod
    def get_driver(cls, browser):
        if cls.driver is not None:
            return cls.driver
        if (browser.lower()) == "chrome":
            cls.driver = webdriver.Chrome("../chromedriver")
        return cls.driver

    def __init__(self, browser):
        print("creating..")
        driver = google_search_lib.get_driver(browser)
        self.driver = driver
        self.wait = WebDriverWait(driver, 10)

    def search_for(self, term):
        print("open")
        self.driver.get("http://google.com/")
        search_box = self.driver.find_element_by_name("q")
        search_box.send_keys(term)
        search_box.send_keys(Keys.RETURN)

    def close(self):
        self.driver.quit()

Defining the library import names is a bit tricky. If it is the same in both cases (module + class) just one is needed.

Again, running it as before:

from robot.running import TestSuiteBuilder

suite = TestSuiteBuilder().build("./google_search.robot")
result = suite.run()

If you think about this for a moment, there is some strange magic here. Why is the classmethod there? How is state managed within tests / suites? I borrowed the initial code for this example from this fine tutorial. It does not discuss the use of this annotation, but it seems to me that this is used to shared the driver object during test execution.

Mapping Python Functions to Keywords

It is simply by taking the function name and underscores for space. So in the above google_search_lib.py example, the Search For maps to the search_for() function. Close keyword maps to close() function. Much complex, eh?

Test Setup and Teardown

Test setup and teardown are some basic functionality. This is supported in RF by specific keywords in the Settings section.

Example test file:

*** Settings ***
Library         google_search_lib.py    chrome
Test Setup      Log To Console    Starting a test...
Test Teardown   Close

*** Test Cases ***
Run A Google Search
    Search for      emoji wars
    Sleep           10s

The referenced google_search_lib.py file is the same as above. This includes defining the close function / keyword used in Test Teardown.

Run it as usual:

from robot.running import TestSuiteBuilder

suite = TestSuiteBuilder().build("./google_search.robot")
result = suite.run()

You can define a single keyword for both setup and teardown. RF docs suggest to write your own custom keyword, composing multiple actions as needed.

The way the library class is defined and created is also impacted on how the scope of the library is defined. It seems to get a bit tricky to manage the resources, since sometimes the instances are different in the setup, teardown, tests, or in all tests. I think this is one of the reasons for using the classmethod annotation in the tutorial example I cited.

There would be much more such as variables in tests. And RF also supports the BDD (Gherkin) syntax in addition to the keywords I showed here. But the underlying framework is quite the same in both cases.

Anyway, that’s all I am writing on today. I find RF is quite straightforward once you get the idea, and not too complex to use even with the docs not being so straightforward. Overall, a very simple concept, and I guess one that the author(s) have managed to build a reasonable community around. Which I guess is what makes it useful and potentially successfull.

I personally prefer writing software over putting keywords after one another, but for writing tests I guess this is one useful method. And maybe there is an art in itself to writing good, suitably abstracted, reusable yet concrete keywords?

That’s all, folks,…

Machine Learning in Software Testing

Early 2019 Edition

Introduction

Software testing has not really changed all that much in the past decades. Machine learning on the other hand is a very rapidly evolving technology being adopted all over the place. So what can it bring to software testing?

Back in 2018 (so about a year ago from now) I did a review of machine learning (ML) (and deep learning (DL)) applications in Network Analysis and Software Testing. After spending some more time learning and trying ML/DL in practice, this is an update on the ML for testing part, reflecting on my own learnings and some developments over this past year. Another interesting part would be testing ML system. I will get to that in another post.

In my last years review, I focused on several topics. A recent academic study (Durelli2019) in this area also lists a number of topics. This includes topics such as "learning test oracles", which basically translates to learning a model of a system behaviour based on some observations or other data about the software behaviour. Last years I included this under the name of specification mining. In practice, I have found such learned behavioral models to be of limited use, and have not seen general uptake anywhere in practice. In this review I focus on fewer topics I find more convincing for practical use.

I illustrate these techniques with this nice pic I made:

Topics

In this pic, the "Magic ML Oracle" is just a ML model, or a system using such a model. It learns from a set of inputs during the training phase. In the figure above this could be some bug reports linked to components (file handling, user interface, network, …). In the prediction phase it runs as a classifier, predicting something such as which component a reported issue should be assigned to, how fault-prone an analysis is (e.g., how to focus testing), or how tests and specs are linked (in case of missing links).

The topics I cover mainly relate to using machine learning to analyze various test related artefacts as in the figure above. One example of this is the bug report classifier I built previously. Since most of these ML techniques are quite general, just applied to software testing, ideas from broader ML applications could be useful here as well.

Specifically, software testing is not necessarily that different from other software engineering activities. For example, Microsoft performed an extensive study (Kim2017) on their data scientists and their work in software engineering teams. This work includes bug and performance analysis and prioritization, as well as customer feedback analysis, and various other quality (assurance) related topics.

As an example of concrete ML application to broader SW engineering, (Gu2018) maps natural language queries to source code to enable code search. To train a DL model for this, one recurrent neural network (RNN) based model is built for the code description (from source comments), and another one for the source code. The output of these two is a numerical feature vector. Cosine similarity is a measure used to compare how far apart two such vectors are, and here it is used as the training loss function. This is a nice trick to train a model to map source code constructs to natural language "constructs", enabling mapping short queries to new code in similar ways. It is also nicely described in the morning paper. I see no reason why such queries and mappings would not work for test related searches and code/documents as well. In fact, many of the techniques I list in following sections use quite similar approaches.

Like I said, I am focusing on a smaller set of more practical topics than last year in this still-overly-long post. The overall idea of how to apply these types of techniques in testing in general should not be too different though. This time, I categorize topics to test prioritization bug report localization, defect prediction, and traceability analysis. One section to go over each, one to rule over them all.

Test Prioritization

As (software) organizations and projects grow over time, their codebase tends to grow with them. This would lead to also having more tests to cover that codebase (hopefully…). Running all possible test cases all the time in such a scenario is not always possible or cost-efficient, as it starts to take more and more time and resources. The approach to selecting a subset of the tests to execute has different names: test prioritization, test suite optimization, test minimization, …

The goal with this being to cover as much of the fault-prone areas with fewer tests, such as in this my completely made up image to illustrate the topic:

Prioritized coverage

Consider the coverage % in the above to reflect, for example, covering changes since last time tests were run. Aiming to cover changes did not break anything as fast and efficient as possible.

An industrial test prioritization system used at Google (in 2017) is described in (Memon2017). This does not directly discuss using machine learning (although mentions it as future plan for the data). However, I find it interesting for general data-analysis of testing related data as a basis for test prioritization. It also works to provide a basis for a set of features for ML algorithms, as understanding and tuning your data is really the basis for all ML applications.

The goal in this (Memon2017) case is two-fold: better utilizing test resources (focus on potentially failing tests) and provide feedback to the developers about their commits. The aim is not 100% accurate predictions but rather focusing automated test execution and providing the developer with feedback such as "this commit is 95% more likely to cause breakage due to the code being touched by 5 developers in the past 10 days, and being written in Java". The developers can use this feedback to seek further assurance in additional reviews, more testing, static analysis, and so on.

Some of the interesting features/findings described in (Memon2017):

  • Only about 1% of the tests ever failed during their lifetime.
  • Thus about 99% speedup would be possible if right tests could be identified.
  • Use of dependency distance as a feature: What other component depends on the changed component, and through how many other components
  • Test targets further away from the change are much less likely to fail. So dependency distance seems like a useful prediction (feature) metric. They used a threshold of 10 for their codebase, which might vary by project but the idea likely holds.
  • Files/targets modified more often are more likely to cause breakages.
  • File type affects likelihood of breakage.
  • User/tool id affects likelihood of breakage.
  • Number of developers having worked on single file in a short time affects likelihood of breakage. More developers means higher likelihood to break.
  • The number of test targets affected by a change varies greatly, maybe requiring different treatment.

A similar set of features is presented in (Bhagwan2018):

  • Developer experience: Developer time in the organization and project
  • Code ownership: More developers changing files/components cause more bugs
  • Code hotspots: Specific parts of code that cause issues when changed
  • Commit complexity: Number of changes, changed files, review comments in a single commit. More equals more bugs.

A test prioritization approach taken at Salesforce.com is described in (Busjaeger2016). They use five types of features:

  • code coverage
  • text path similarity
  • text content similarity
  • test failure history
  • test age

In (Bhagwan2018), the similarity scores are based on TF-IDF scores and their cosine similarity calculation. TF-IDF simply weights frequency of words in a document against the frequency of the same word in all other documents, to identify most specific terms for document types. The features are fed into a support-vector model to rank tests to execute first. In their (Bhagwan2018) experience, about 3% of overall tests is required to reach about 75% coverage. From the predicted tests, about every 5th is found to be causing a failure.

I find the above provide good examples of data analysis, and basis for defining ML features.

In the (Durelli2019) review, several studies are listed under test prioritization, but these mostly do not strike me as very realistic examples of ML applications. However, one interesting approach is (Spieker2017), which investigates using reinforcement learning for test prioritization. It uses only three features: execution time (duration), last execution (whatever that means..), and failure history. These seem a rather simple set of features to build a complex model, and it seems likely to me that a simple model would also work here. The results in (Spieker2017) are presented as good but not investigated in depth so hard to say from just that. However, I did find the approach to present some interesting ideas in relation to this:

  • Continuous integration systems constantly execute the test suites so you will have a lot of constantly updating data about test suites, execution, results available
  • Continuously updating the model over time based on a last N test runs from past
  • Using a higher exploration rate over full suite to bootstrap the model, lowering over time when it has learned but not setting to zero
  • Using "test case" as model state, and assigning it a priority as an action
  • Listing real "open-data" industry-based datasets to evaluate prioritization ML models on

I would be interested to see how well a simple model, such as Naive Bayes, weighting the previous pass/fails and some pattern over their probability would work. But from the paper it is hard to tell. In any case, the points above would be interesting to explore further.

A Thought (maybe Two)

I assume ML has been applied to test prioritization, just not so much documented. For example, I expect Google would have taken their studies further and used ML for this as discussed in their report (Memon2017). Test prioritization seems like a suitably complex problem, with lots of easily accessible data, and with a clear payoff in sight, to apply ML. The more tests you have, the more you need to execute, the more data you get, the more potential benefit.

In this as in many advanced research topics, I guess the "killer app" might come by integrating all this into a test system / product as a black-box. This would enable everyone to make use of it without requiring to learn all the "ML in test" details outside their core business. Same I guess applies to the other topics I cover in the following sections.

Bug Report Localization

Bug report localization (in this case anyway) involves taking a bug report and finding the component or other part of the software that the report is most likely to concern. Various approaches aim to automate this process by using machine learning algorithms. My previous example is one example of building one.

I made a pretty picture to illustrate this:

Localization Oracle

Typically a bug report is expressed in natural language (at least partially, with code snippets embedded). These are fed to the machine learning classifier (magic oracle in the pic above), which assigns it to 1-N potential components. Component granularity and other details may wary but this is the general idea.

For this, code structural elements used as features include (Tufano2018):

  • sequences of abstract syntax tree (AST) nodes
  • sequences of call-flow-graphs (CFG) nodes
  • bytecode representations. This seems interesting in mapping the code to fewer shared elements (opcodes)

Other features include (Lam2017):

  • camel-case splitting source code (n-grams would seem a natural fit too)
  • time since a file was previously changed when fixing a bug
  • how many bugs overall have been fixed in a file
  • similarity between a bug report and previous bug reports (and what were they assigned to)

Besides using such specific code structures as inputs, also specific pre-processing steps are taken. These include (Tufano2018, Li2018):

  • replacing constant value with their types,
  • splitting camel-case,
  • removing low-level detailed abstract syntax tree (AST) nodes,
  • filtering out methods less than 10 lines long.
  • regular expressions to remove code format characters, and to identify code snippets embedded into the bug report.

An industrial case study on bug localization from Ericsson is presented in (Johnson2016). Topic models built with Latent Dirichlet Allocation (LDA) are learned from the set of bug reports. These are used to assign topic weights to bug reports based on the bug report text. The assigned weights are compared to the learned topic distribution for components, and the higher the match of distributions in the report vs learned component model, the higher the probability to assign the bug report to that component.

Vector Space Model (VSM) was used as a baseline comparison in many cases I found. This is based on TF-IDF scores (vectors) calculated for a document. Similarity between a bug report and source code files in VSM is calculated as a cosine similarity between their TF-IDF vectors. Revised Vector Space Model (rVSM) (Zhou2012) is a refinement of VSM that weights larger documents more, reasoning that bugs are more often found in larger source files. (Zhou2012) also adds weighting from similarity with previous bug reports.

Building on rVSM, (Lam2017) uses an auto-encoder neural network on TF-IDF weighted document terms to map different terms with similar meaning together for more accurate bug localization. Similarly, the "DeepSum" work (Li2018) uses an auto-encoder to summarize bug reports, and to compare their TF-IDF distance with cosine similarity. To me this use of auto-encoders seems like trying to re-invent word-embeddings for no obvious gain, but probably I miss something. After auto-encoding, (Lam2017) combines a set of features using a deep neural network (multi-layer perceptron (MLP) it seems) for final probability evaluation. In any case, word-embedding style mapping of words together in a smaller dimension is found in these works as others.

A Thought or Two

I am a bit surprised not to see much work in applying RNN type networks such as LSTM and GRU into these topics, since they are a great fit for processing textual documents. In my experience they are also quite powerful compared to traditional machine learning methods.

I think this type of bug report localization has practical relevance mainly for big companies with large product teams and customer bases, and complex processes (support levels, etc). This is in domains like telcos, from which the only clear industry report I listed here is from (Ericsson). Something I have found limiting these types of applications in practice is also the need for cross-domain vision to combine these topics and expertise. People seem often quite narrowly focused on specific areas. Black-box integration with common tools might help, again.

Defect Prediction

Software defect prediction refers to predicting which parts of the software are most likely to contain faults. Sometimes this is also referred to as fault proneness analysis. Aim is to provide additional information to help focus testing efforts. This is actually very similar to the bug report localization I discussed above, but with the goal of predicting where currently unknown bugs might appear (vs localizing existing issue reports).

An overall review of this area is presented in (Malhotra2015), showing an extensive use of traditional ML approaches (random forest, decision trees, support vector machines, etc) and traditional source code metrics (lines of code, cyclomatic complexity, etc.) as features. These studies show reasonably good accuracies up from 75% to 93%.

However, another broad review on these approaches and their effectiveness is presented in (Zhou2018). It shows how simply using larger module size to predict higher fault proneness would give equal or better accuracy in many cases. This is my experience from many contexts, keeping it simple is often very effective. But on the other hand, finding that simplicity can be the real challenge, and you can learn a lot by trying different approaches.

More recently, deep learning based approaches have also been applied to this domain. Deep Belief Nets (DBN) are applied in (Wang2018) to generate features from source code AST, and combined with more traditional source code metrics. The presentation on DBNs in (Zhou2018) is a bit unclear to me, but it seems very similar to a MLP. The output of this layer is then termed (as far as I understand) as "semantic feature vector". I looked a bit into the difference of DBN vs MLP, and found some practical discussion at a Keras issue. Make what you will of it. Do let me know if you figure it out better than I did (what is the difference in using a MLP style fully connected dense layer here vs DBN).

An earlier version of the (Wang2018) work is refined and further explored using convolutional neural networks (CNNs) in (Li2017). In this case, a word2vec word-embedding layer is used as the first layer, and trained on the source and AST vocabulary. This is fed into a 1-dimensional CNN, which is one of the popular deep learning network types for text processing. After passing through this part of the network, the output feature vector is merged with a set of the more traditional source metrics (lines of code, etc). These are together merged for the final network layers to do the prediction, which are fed into the final single-node output layer for the probability prediction.

Illustration of this network:

Metrics based model

To address class imbalance (more "clean" than "buggy" files), (Li2017) uses duplication of the minority class instances. They also compare to traditional metrics as well as the DBN from (Wang2018) and DBN+ whichs combines the traditional features with the DBN "semantic" features. As usual for research papers, they report getting better results with the CNN+ version. Everyone seems to do that, yet perfection seems never to be achieved, or even nearly. Strange.

A Thought

The evolution in defect prediction seems to be from traditional classifiers with traditional "hand-crafted" (source metrics) features to deep-learning generated and AST-based features in combination with traditional metrics. Again, I did not see any RNN based deep-learning classifier applications, although I would expect they should be quite well suited for this type of analysis. Perhaps next time.

Traceability Analysis

Despite everyone being all Agile now, heavier processes such as requirements traceability can still be needed. Especially for complex enough systems, and ones with heavy regulatory- or standards-based compliance requirements. Medical, telco, automotive, … In the real world, such traces may not always be documented, and sometimes it is of interest to find them.

A line of work exploring the use of deep learning for automating the generation of traceability links between software artefacts is in (Guo2017, Rath2018). These are from the same major software engineering conference (ICSE) over two following years, with some author overlap. So there is some traceability in this work too, heh-heh (such joke, much non-academic). The first one (Guo2017) aims to link requirements to design and test artefacts in the train control domain. The second one aims to link code submissions to issues/bug reports on Github.

Requirements documents

Using recurrent neural networks (RNN) to link requirements documents to other documents is investigated in (Guo2017). I covered this work to some extend already last year, but lets see if I can add something with what I learned since.

Use cases fot this as mentioned in (Guo2017):

  • Finding new, missing (undocumented) links between artefacts.
  • Train on a set of existing data for existing projects, apply to find links within a new project. This seems like a form of transfer learning, and is not explored in the paper. It focuses on the first bullet.

I find the approach used in (Guo2017) interesting, linking together two recurrent neural network (RNN) layers from parallel input branches for natural language processing (NLP):

Requirements linking NN

There are two identical input branches (top of figure above). One for the requirements documents, and one for the target document for which the link is assessed. Let’s pretend the target is a test document to stay relevant. A pair of documents is fed to different input branches of the network, and the network outputs a probability of these two documents being linked.

In ML you typically try different model configurations and hyperparameters to find what works best. In (Guo2017) they tried different types of layers and parameters. The figure above shows what they found best for their task. See Guo2017) for the experiment details for other parameter values. Here, a bi-directional gated recurrent unit (bi-GRU) layer is used to process each document into a feature vector.

When the requirements document and the target document have been transformed by this to a vector representation, they are fed into a pointwise multiplication layer (mul) and to a vector substraction layer (sub). In Keras this would seem to be a Merge layer with type "mul" or "sub". These merge layers are intended to describe the vector difference direction (mul) and distance (sub) across dimensions. A dense layer with sigmoid activation is used to integrate these two merges, and the final output is given by a 2-neuron softmax layer (linked/not linked probability).

For word-embeddings they try both a domain specific (train-control systems in this case) embedding with 50-dimensions, and a 300-dimensional one combining the domain-specific data with a Wikipedia text dump. They found the domain specific one works better, speculating it to be due to domain-specific use of words.

Since this prediction can produce many different possibilities in a ranked order, simple accuracy of the top choice is not "accurate" itself as an evaluation metric. For evaluating the results, (Guo2017) uses mean average prediction (MAP) as a metric. The MAP achieved in (Guo2017) is reported up to 83.4%. The numbers seem relatively good, although I would need to play with the results to really understand everything in relation to this metric.

An interesting point from (Guo2017) is a way to address class imbalances. The set of requirements and other documents that have valid links they have is a small fraction of the overall set. So the imbalance between the true and false labels is big. They address this by selecting an equal set of true and false labels for an epoch, and switching the set of false label items at the start of each epoch. So all the training data is processes, while a balance is held in each epoch. Nice.

Github Issues

Traceability for linking code commits to bug tracker issues and improvement tickets ("bugs" and "improvement" in project Jira) is presented in (Rath2018). The studied projects are 6 open-source projects written in Java. Unlike the previous study on requirements linking, this study does not use deep-learning based approaches but rather manual feature engineering and more traditional ML classifiers (decision trees, naive bayes, random forest).

This is about mapping issue reports to commits that fix those issues:

Github Issue Linking

Besides more traditional features, (Rath2018) also makes use of time related aspects as extra filtering features. A training set is built by finding commit messages that reference affected issue IDs. The features used include:

  • Timestamp of commit. Has to be later than creation timestamp for potential issue it could be linked to. Has to be inside given timeframe since issue was marked resolved.
  • Closest commit before analyzed commit, and its linked issues.
  • Closest commit after analyzed commit, and its linked issues.
  • Committer id
  • Reporter id
  • Textual similarity of commit source/message and issue text. TF-IDF weighted word- and ngram-counts.

The study in (Rath2018) looks at two different types of analysis for the accuracy of the ML classifier trained. In the first case they try to "reconstruct known links", and in the second case "construct unknown links". They further consider two different scenarios: recommending links for commits, and fully automated link generation. For assistance, their goal is to have the correct link tag in the top 3 suggestions. The automated tagging scenario requires the first predicted tag to be correct.

Not surprisingly, the top 3 approach has better results as it gives the classifier more freedom and leeway. Their results are reported with up to 95%+ recall but with a precision of around 30%. This seems to be in line with what I saw when I tried to build my issue categorization classifier. The first result may not always be correct but many good ones are in the top (and with too many possibilities, even the "correct" one might be a bit ambiguous).

The second use case of constructing previously unknown links sounds to me like it should be very similar in implementation to the first one, but it appears not. The main difference comes from there being large numbers of commits that do not actually address a specific Jira issue or ticket. The canonical example given is a refactoring commit. The obvious (in hindsight) result seems to state you are more likely to find a link if one is known to exist (case 1) vs finding one if it might not exist at all (case 2) :).

A Thought or Two

The point of the requirements linking approach finding the domain-specific word-embeddings better is interesting. In my previous LSTM bug predictor, I found domain specific training helps in similar way, although in that case also combining with the pre-trained word-embeddings worked nicely as well. Of course, I used a large pre-trained Glove embedding for that and did not train on Wikipedia myself. And used Glove vs Word2Vec but I would not expect a big difference.

However, the domain-specific embeddings performance sounds similar to ELMo, Bert, and other recent developments in context-sensitive embeddings. By training only on the domain-specific corpus, you likely get more context-sensitive embeddings for the domain. Maybe the train-control domain in (Guo2017) has more specific vocabulary, or some other attributes that make the smaller domain-specific embedding alone better? Or maybe the type of embedding and its training data makes the difference? No idea. Here’s hoping Elmo style contextual embeddings are made easy to add to Keras models soon, so I can more broadly experiment with those as well. In my obvious summary, I guess it is always better to try different options for different data and models..

Parting Notes

I tried to cover some different aspects of ML applications in software testing. The ones I covered seem to have quite a lot in common. In some sense, they are all mapping documents together. The set of features are also quite common, "traditional" source code metrics along with NLP features. Many specific metrics have also been developed as I listed above, such as modification and modifier (commit author) counts. Deep learning approaches are used to some extent, but it still seems to be making its way in this domain.

Besides what I covered, there are of course other approaches to apply ML to SW testing. I covered some last year, and (Durelli2019) covers much more from an academic perspective. But I found the ones I covered here to be a rather representative set of the ones I consider closest to practical today. If you have further ideas, happy to hear.

In general, I have not seen much of ML applied in meaningful ways to software testing. One approach I have used in past is to use ML as a tool for learning about a test network and its services (Kanstren2017). I am not sure if that really qualifies for a ML application to software testing, since it investigated properties of the test network itself and its services, not the process of testing. Perhaps the generalization of that is in "using machine learning with testing technologies". This would be different from applying ML to testing itself, as well as different from testing ML applications. Have to think about that.

Next I guess I will see if/when I have some time to look at the testing ML applications part. With all the hype on self-driving cars and everything else, that should be interesting.

See, I made this nice but too small text picture of the tree facets of ML and SW Testing I listed above:

Test vs ML facets

References

R. Baghwan et al., "Orca: Differential Bug Localization in Large-Scale Services", 13th USENIX Symposium on Operating Systems Design and Implementation, 2018.

B. Busjaeger, T. Xie, "Learning for Test Prioritization: An Industrial Case Study", 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE), 2016.

N. DiGiuseppe, J.A. Jones, "Concept-Based Failure Clustering", ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering (FSE), 2012.

V. H. S. Durelli et al., "Machine Learning Applied to Software Testing: A Systematic Mapping Study", IEEE Transactions on Reliability, 2019.

X. Gu, H. Zhang, S. Kim, "Deep code search", 40th International Conference on Software Engineering (ICSE), 2018.

J. Guo, J. Cheng, J. Cleland-Huang, "Semantically Enhanced Software Traceability Using Deep Learning Techniques", 39th IEEE/ACM International Conference on Software Engineering (ICSE), 2017.

L. Johnson, et al., "Automatic Localization of Bugs to Faulty Components in Large Scale Software Systems Using Bayesian Classification", IEEE International Conference on Software Quality, Reliability and Security (QRS), 2017.

T. Kanstren, "Experiences in Testing and Analysing Data Intensive Systems", IEEE International Conference on Software Quality, Reliability and Security (QRS, industry track), 2017

M. Kim, et al., "Data Scientists in Software Teams: State of the Art and Challenges", IEEE Transactions on Software Engineering, vol. 44, no. 11, 2018.

A. N. Lam, A. T. Nguyen, H. A. Nguyen, T. N. Nguyen, "Bug Localization with Combination of Deep Learning and Information Retrieval", IEEE International Conference on Program Comprehension, 2017.

J. Li, P. He, J. Zhu, M. R. Lye, "Software Defect Prediction via Convolutional Neural Network", IEEE International Conference on Software Quality, Reliability and Security (QRS), 2017.

X. Li et al., "Unsupervised deep bug report summarization", 26th Conference on Program Comprehension (ICPC), 2018.

R. Malhotra, "A systematic review of machine learning techniques for software fault prediction, Applied Soft Computing, 27,2015.

A. Memon et al., "Taming Google-scale continuous testing", 2017 IEEE/ACM 39th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP), 2017.

M. Rath, J. Rendall, J.L.C Guo, J. Cleland-Huang, P. Mäder, "Traceability in the wild: Automatically Augmenting Incomplete Trace Links", 40th IEEE/ACM International Conference on Software Engineering (ICSE), 2018.

M. Tufano et al., "Deep learning similarities from different representations of source code", 15th International Conference on Mining Software Repositories (MSR), 2018.

S. Wang, T. Liu, J. Nam, L. Tan, "Deep Semantic Feature Learning for Software Defect Prediction", IEEE Transactions on Software Engineering, 2018.

J. Zhou, H. Zhang, D. Lo, "Where should the bugs be fixed? More accurate information retrieval-based bug localization based on bug reports", International Conference on Software Engineering (ICSE), 2012.

Y. Zhou et al, "How Far We Have Progressed in the Journey? An Examination of Cross-Project Defect Prediction", ACM Transactions on Software Engineering and Methodology, no. 1, vol. 27, 2018.

Predicting issue categories on Github

Practical examples of applying machine learning seem to be a bit difficult to find. So I tried to create one for a presentation I was doing on testing and data analytics. I made a review of works in the area, and just chose one for illustrate. This one tries to predict a target category to assign for an issue report. I used ARM mBed OS as a test target since it has issues available on Github and there were some people who work with it attending the presentation.

This demo “service” I created works by first training a predictive model based on a set of previous issue reports. I downloaded the reports from the issue repository. The amount of data available there was so small, I just downloaded the issues manually using the Github API that let me download the data for 100 issues at once. Automating the download should be pretty easy if needed. The amount of data is small, and there are a large number of categories to predict, so not the best for results, but serves as an example to illustrate the concept.

And no, there is no deep learning involved here, so not quite that trendy. I don’t think it is all that necessary for this purpose or this size of data. But could work better of course, if you do try, post the code so we can play as well.

The Github issues API allows me to download the issues in batches. For example, to download page 12 of closed issues, with 100 issues per page, the URL to request is https://api.github.com/repos/ARMmbed/mbed-os/issues?state=closed&page=12&per_page=100. The API seems to cut it down to 100 even if using bigger values than 100. Or I just didn’t quite use it right, whichever. The API docs describe the parameters quite clearly, I downloaded open and closed issues separately, even if I did not use the separation in any meaningful way in the end.

The code here is all in Python. The final classifier/prediction services code is available on my Github repository.

First build a set of stopwords to do some cleaning on the issue descriptions:

	stop_words = set(stopwords.words('english'))
	stop_words = stop_words.union(set(punctuation))
	stop_words.update(["''", "--", "**"])

The above code uses the common NLTK stopwords, a set of punctuation symbols, and a few commonly occurring symbol combinations I found in the data. Since later on I clean it up with another regular expression, probably just the NLTK stopwords would suffice here as well..

To preprocess the issue descriptions before applying machine learning algorightms:

def preprocess_report(body_o):
	#clean issue body text. leave only alphabetical and numerical characters and some specials such as +.,:/\
	body = re.sub('[^A-Za-z0-9 /\\\_+.,:\n]+', '', body_o)
	# replace URL separators with space so the parts of the url become separate words
	body = re.sub('[/\\\]', ' ', body)
	# finally lemmatize all words for the analysis
	lemmatizer = WordNetLemmatizer()
	# text tokens are basis for the features
	text_tokens = [lemmatizer.lemmatize(word) for word in word_tokenize(body.lower()) if word not in stop_words]
	return text_tokens

Above code is intended to remove all but standard alphanumeric characters from the text, remove stop words, and tokenize the remaining text into separate words. It also splits URL’s into parts as separate words. The lemmatization changes known words into their baseforms (e.g., “car” and “cars” become “car”). This just makes it easier for the machine learning algorithm to match words together. Another option is stemming, but lemmatization produces more human-friendly words so I use that.

I stored the downloaded issues as JSON files (as Github API gives) in the data directory. To read all these filenames for processing:

#names of files containing closed and open issues (at time of download)
closed_files = glob.glob("data/*-closed-*")
open_files = glob.glob("data/*-closed-*")

To process those files, I need to pick only the ones with an assigned “component” value. This is what is the training target label. The algorithm is trained to predict this “component” value from the issue description, so without this label, the piece of data is not useful for training.

def process_files(files):
	'''
	process the given set of files by collecting issue body text and labels.
	also cleans and lemmatizes the body text

	:param files: names of files to process
	:return: nothing
	'''
	global total

	for filename in files:
		with open(filename, encoding="utf-8") as json_data:
			print(filename)
			all_issues = json.load(json_data)
			for issue in all_issues:
				labels = issue["labels"]
				for label in labels:
					if label["name"].startswith("component:"):
						name = label["name"]
						body_o = issue["body"]
						text_tokens = preprocess_report(body_o)
						all_text_tokens.append((text_tokens))
						#component_labels are prediction targets
						component_labels.append(name)
						total += 1

print("total: ", total)
						

There is a limited number of such labeled data items, as many of the downloaded issues do not have this label assigned. The print at the end of the above code shows the total number of items with the “component” label given, and the number in this dataset is 1078.

Besides removing stop-words and otherwise cleaning up the documents for NLP, combining words sometimes makes sense. Pairs, triplets, and so on are sometimes meaningful. Typical example is words “new” and “york” in a document, versus “new york”. This would be an example of a bi-gram, combining two words into “new_york”. To do this, I use the gensim package:

import gensim

#https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/
# Build the bigram and trigram models
bigram = gensim.models.Phrases(all_text_tokens, min_count=5, threshold=100) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[all_text_tokens], threshold=100)

# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

#just to see it works
print(trigram_mod[bigram_mod[all_text_tokens[4]]])

#transform identified word pairs and triplets into bigrams and trigrams
text_tokens = [trigram_mod[bigram_mod[text]] for text in all_text_tokens]

#build whole documents from text tokens. some algorithms work on documents not tokens.
texts = [" ".join(tokens) for tokens in text_tokens]

The above code uses thresholds and minimum co-occurrence counts to avoid combining every possible word with every other possible word. So only word-pairs and triplets that commonly are found to occur together are used (replaced) in the document.

Use the Python data processing library Pandas to turn it into suitable format for the machine learning algorithms:

from pandas import DataFrame

df = DataFrame()

df["texts"] = texts
df["text_tokens"] = text_tokens
df["component"] = component_labels

print(df.head(2))

First to have a look at the data:

#how many issues are there in our data for all the target labels, assigned component counts
value_counts = df["component"].value_counts()
#print how many times each component/target label exists in the training data
print(value_counts)
#remove all targets for which we have less than 10 training samples.
#K-fold validation with 5 splits requires min 5 to have 1 in each split. This makes it 2, which is still tiny but at least it sorta runs
indices = df["component"].isin(value_counts[value_counts > 9].index)
#this is the weird syntax i never remember, them python tricks. i think it slices the dataframe to remove the items not in "indices" list
df = df.loc[indices, :]

The above code actually already does a bit more. It also filters the dataset to remove the rows with component values that only have less than 10 items. So this is the unfiltered list:

component: tools              162
component: hal                128
component: export             124
component: networking         118
component: drivers            110
component: rtos                88
component: filesystem          80
component: tls                 78
component: docs                60
component: usb                 54
component: ble                 38
component: events              14
component: cmsis               10
component: stdlib               4
component: rpc                  4
component: uvisor               2
component: greentea-client      2
component: compiler             2

And after filtering, the last four rows will have been removed. So in the end, the dataset will not have any rows with labelsl “rpc”, “uvisor”, “greentea-client”, or “compiler”. This is because I will later use stratified 5-fold cross-validation and this requires a minimum of 5 items of each. Filtering with minimum of 10 instances for a label, it is at least possible to have 2 of the least common “component” in each fold.

In a more realistic case, much more data would be needed to cover all categories, and I would also look at possibly combining some of the different categories. And rebuilding the model every now and then, depending on how much effort it is, how much new data comes in, etc.

To use the “component” values as target labels for machine learning, they need to be numerical (integers). This does the transformation:

from sklearn.preprocessing import LabelEncoder

# encode class values as integers
encoder = LabelEncoder()
encoded_label = encoder.fit_transform(df.component)

Just to see how the mapping of integer id’s to labels after label encoding looks:

unique, counts = np.unique(encoded_label, return_counts=True)
print(unique) #the set of unique encoded labels
print(counts) #the number of instances for each label

The result (first line = id, second line = number of items):

[ 0  1  2  3  4  5  6  7  8  9 10 11 12]
[ 38  10  60 110  14 124  80 128 118  88  78 162  54]

Mapping the labels to integers:

#which textual label/component name matches which integer label
le_name_mapping = dict(zip(encoder.classes_, encoder.transform(encoder.classes_)))
print(le_name_mapping)

#which integer matches which textual label/component name
le_id_mapping = dict(zip(encoder.transform(encoder.classes_), encoder.classes_))
print(le_id_mapping)

So the first is to print “label: id” pairs, and the second to print “id: label” pairs. The first one looks like this:

'component: ble': 0, 
'component: cmsis': 1, 
'component: docs': 2, 
'component: drivers': 3, 
'component: events': 4, 
'component: export': 5, 
'component: filesystem': 6, 
'component: hal': 7, 
'component: networking': 8, 
'component: rtos': 9, 
'component: tls': 10, 
'component: tools': 11, 
'component: usb': 12

Now, to turn the text into suitable input for a machine learning algorithm, I transform the documents into their TF-IDF presentation. Well, if you go all deep learning with LSTM and the like, this may not be necessary. But don’t take my word for it, I am still trying to figure some of that out.

TF-IDF stands for term frequency (TF) – inverse document frequency (IDF). For example, if the word “bob” appears often in a document, it has a high term frequency for that document. Generally, one might consider such a word to describe that document well (or the concepts in the document). However, if the same word also appears commonly in all the documents (in the “corpus”), it is not really specific to that document, and not very representative of that document vs all other documents in the corpus. So IDF is used to modify the TF so that words that appear often in a document but less often in others in the corpus get a higher weight. And if the word appears often across many documents, it gets a lower weight. This is TF-IDF.

Traditional machine learning approaches also require a more fixed size set of input features. Since documents are of varying length, this can be a bit of an issue. Well, I believe some deep learning models also require this (e.g., CNN), while others less so (e.g., sequential models such as LSTM). Digressing. TF-IDF also (as far as I understand) results in a fixed length feature vector for all documents. Or read this on Stack Overflow and make your own mind up.

Anyway, to my understanding, the feature space (set of all features) after TF-IDF processing becomes the set of all unique words across all documents. Each of these is given a TF-IDF score for each document. For the words that do not exist in a document, the score is 0. And most documents don’t have all words in them, so this results in a very “sparse matrix”, where the zeroes are not really stored. That’s how you can actually process some reasonable sized set of documents in memory.

So, in any case, to convert the documents to TF-IDF presentation:

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5)

#transfor all documents into TFIDF vectors.
#TF-IDF vectors are all same length, same word at same index, value is its TFIDF for that word in that document
features_transformed = vectorizer.fit_transform(features)

Above code fits the vectorizer to the corpus and then transforms all the documents to their TF-IDF representations. To my understanding (from SO), the fit part counts the word occurrences in the corpus, and the transform part uses these overall counts to transform each document into TF-IDF.

It is possible also to print out all the words the TF-IDF found in the corpus:

#the TFIDF feature names is a long list of all unique words found
print(vectorizer.get_feature_names())
feature_names = np.array(vectorizer.get_feature_names())
print(len(feature_names))

Now to train a classifier to predict the component based on a given document:

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score

kfold = StratifiedKFold(n_splits=5) #5-fold cross validation

#the classifier to use, the parameters are selected based on a set i tried before
clf = RandomForestClassifier(n_estimators=50, min_samples_leaf=1, min_samples_split=5)

results = cross_val_score(clf, features_transformed, encoded_label, cv=kfold)

print("Baseline: %.2f%% (%.2f%%)" % (results.mean() * 100, results.std() * 100))

#fit the classifier on the TFIDF transformed word features, train it to predict the component
clf.fit(features_transformed, encoded_label)
probabilities = clf.predict_proba(features_transformed[0])
print(probabilities)

In the above I am using RandomForest classifier, with a set of parameters previously tuned. I am also using 5-fold cross validation, meaning the data is split into 5 different parts. The parts are “stratified”, meaning each fold has about the same percentage of each target label as the original set. This is why I removed the labels with less that 10 instances in the beginning, to have at least 2 for each class. Which is till super-tiny but thats what this example is about.

The last part of the code above also runs a prediction on one of the transformed documents just to try it out.

Now, to run predictions on previously unseen documents:

import requests

def predict_component(issue):
	'''
	use this to get a set of predictions for a given issue.

	:param issue: issue id from github.
	:return: list of tuples (component name, probability)
	'''
	#first load text for the given issue from github
	url = "https://api.github.com/repos/ARMmbed/mbed-os/issues/" + str(issue)
	r = requests.get(url)
	url_json = json.loads(r.content)
	print(url_json)
	#process the loaded issue data to format matching what the classifier is trained on
	issue_tokens = preprocess_report(url_json["body"])
	issue_tokens = trigram_mod[bigram_mod[issue_tokens]]
	issue_text = " ".join(issue_tokens)
	features_transformed = vectorizer.transform([issue_text])
	#and predict the probability of each component type
	probabilities = clf.predict_proba(features_transformed)
	result = []
	for idx in range(probabilities.shape[1]):
		name = le_id_mapping[idx]
		prob = (probabilities[0, idx]*100)
		prob_str = "%.2f%%" % prob
		print(name, ":", prob_str)
		result.append((name, prob_str))
	return result

This code takes as parameter an issue number for the ARM mBed Github repo. Downloads the issue data, preprocesses it similar to the training data (clean, tokenize, lemmatize, TF-IDF). This is then used as a set of features to predict the component, based on the model trained earlier.

The “predict_component” method/function can then be called from elsewhere. In my case, I made a simple web page to call it. As noted in the beginning of this post, you can find that webserver code, as well as all the code above on my Github repository.

That’s pretty much it. Not very complicated to put some Python lines one after another, but knowing which lines and in which order is perhaps what takes the time to learn. Having someone else around to do it for you if you are a domain expert (e.g., testing, software engineering or similar in this case) is handy, but it can also be useful to have some idea of what happens, or how the algorithms in general work.

Something I left out in all the above was the code to try out different classifiers and their parameters. So I will just put it below for reference.

First a few helper methods:

def top_tfidf_feats(row, features, top_n=25):
	''' Get top n tfidf values in row and return them with their corresponding feature names.'''
	topn_ids = np.argsort(row)[::-1][:top_n]
	top_feats = [(features[i], row[i]) for i in topn_ids]
	df = pd.DataFrame(top_feats)
	df.columns = ['feature', 'tfidf']
	return df

#this prints it for the first document in the set
arr = features_test_transformed[0].toarray()
top_tfidf_feats(arr[0], feature_names)

def show_most_informative_features(vectorizer, clf, n=20):
	feature_names = vectorizer.get_feature_names()
	coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
	top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])
	for (coef_1, fn_1), (coef_2, fn_2) in top:
		print ("\t%.4f\t%-15s\t\t%.4f\t%-15s" % (coef_1, fn_1, coef_2, fn_2))

In above code, “top_tfidf_feats” prints the top words with highest TF-IDF score for a document. So in a sense, it prints the words that TF-IDF has determined to be most uniquely representing that document.

The “show_most_informative_features” prints the top features that a given classifier has determined to be most descriptive/informative for distinguishing target labels. This only works for certain classifiers, which have such simple co-efficients (feature weights). Such as multinomial naive-bayes (MultinomialNB below).

Here is the code to actually try out the classifiers then:

from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB()
clf.fit(features_train_transformed, labels_train)

from sklearn.metrics import accuracy_score

y_pred = clf.predict(features_test_transformed)
y_true = labels_test
acc_score = accuracy_score(y_true, y_pred)
print("MNB accuracy:"+str(acc_score))

show_most_informative_features(vectorizer, clf)

#try it out on a single document
probabilities = clf.predict_proba(features_test_transformed[0])
print(probabilities)

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score

#set of parameters to try
estimators = [10, 20, 30, 40, 50]
min_splits = [5, 10, 20, 30, 40, 50]
min_leafs = [1, 2, 5, 10, 20, 30]

kfold = StratifiedKFold(n_splits=5) #5-fold cross validation

best_acc = 0.0
best_rf = None
for estimator in estimators:
	for min_split in min_splits:
		for min_leaf in min_leafs:
			print("estimators=", estimator, "min_split=", min_split, " min_leaf=", min_leaf)

			clf = RandomForestClassifier(n_estimators=estimator, min_samples_leaf=min_leaf, min_samples_split=min_split)

			results = cross_val_score(clf, features_transformed, encoded_label, cv=kfold)

			print("Baseline: %.2f%% (%.2f%%)" % (results.mean() * 100, results.std() * 100))

			if results.mean() > best_acc:
				best_acc = results.mean()
				best_rf = clf
				print("found better:", best_acc, ", ", best_rf)

print("best classifier:")
print(best_rf)

best_acc = 0
best_rf = None
for estimator in estimators:
	for min_split in min_splits:
		for min_leaf in min_leafs:
			print("estimators=", estimator, "min_split=", min_split, " min_leaf=", min_leaf)

			clf = RandomForestClassifier(n_estimators=estimator, min_samples_leaf=min_leaf, min_samples_split=min_split)
			clf.fit(features_train_transformed, labels_train)

			pred = clf.predict(features_test_transformed)

			accuracy = accuracy_score(labels_test, pred)

			print(accuracy)

			if accuracy > best_acc:
				best_acc = accuracy
				best_rf = clf
				print("found better:", best_acc, ", ", best_rf)
	

In the code above, I use loops to run through the parameters. There is also something called GridSearch in the Python libraries, as well as RandomSearch (for cases where trying all combos is expensive). But I prefer the ability to control the loops, print out whatever I like and all that.

The above code also shows two ways I tried to train/evaluate the RandomForest parameters. First is with k-fold, latter with single test-train split. I picked MultinomialNB and RandomForest because some internet searching gave me the impression they might work reasonably well for unbalanced class sets such as this one. Of course the final idea is always to try and see what works.. This worked quite fine for me. Or so it seems, machine learning seems to be always about iterating stuffs and learning and updating as you go. More data could change this all, or maybe finding some mistake, or having more domain or analytics knowledge, finding mismatching results, or anything really.

What the unbalanced refers to is the number of instances of different components in this dataset, some “components” have many bug repots, while others much less. For many learning algorithms this seems to be an issue. Some searches indicated RandomForest should be fairly robust for this type so this is also one reason I used it.

Running the above code to experiment with the parameters also produced some slightly concerning results. The accuracy for the classifier ranged from 30% to 95% with smallish parameters changes. I would guess that also speaks for the small dataset causing potential issues. Also re-running the same code would give different classifications for new (unseen) instances. Which is what you might expect when I am not setting the randomization seed. But then I would also expect the accuracy to vary somewhat, which it didn’t. So just don’t take this as more than an example of how you might apply ML for some SW testing related tasks. Take it to highlight the need to always learn more, try more things, and get a decent amount of data, evolve models constantly, etc. And post some comments on all the things you think are wrong in this post/code so we can verify the approach of learning and updating all the time :).

In any case, I hope the example is useful for giving an idea of one way how machine learning could be applied in software testing related aspects. Now write me some nice LSTM or whatever is the latest trend in deep learning models, figure out any issues in my code, or whatever, and post some comments. Cheers.

Harharhar said the Santa when capturing browser test metrics with Webdriver and LittleMobProxy

In the modern age of big data and all that, it is trendy to capture as much data we can. So this is an attempt at capturing data on a set of web browsing tests I run on Selenium WebDriver. This is with Java using Selenium Webdriver and LittleMobProxy.

What I do here is configure an instance of LittleMobProxy to capture the traffic to/from a webserver in our tests. The captured data is written to a HAR file (HTTP archive file), the HAR file is parsed and the contents are used for whatever you like. In this case I dump them to InfluxDB and show some graphs on the generated sessions using Grafana.

This can be useful to see how much bandwidth your website uses, how many requests end up being generated, what elements are slowest to load, how your server caching configurations affect everything, and so on. I have used it to provide data for overall network performance analysis by simulating a set of browsers and capturing the data on their sessions.

First up, start the LittleMobProxy instance, create a WebDriver instance, and configure the WebDriver instance to use the LittleMobProxy instance:

    // start the proxy
    BrowserMobProxy proxy = new BrowserMobProxyServer();
    proxy.start(0);
    // get the Selenium proxy object
    Proxy seleniumProxy = ClientUtil.createSeleniumProxy(proxy);

    // configure it as a desired capability
    DesiredCapabilities capabilities = new DesiredCapabilities();
    capabilities.setCapability(CapabilityType.PROXY, seleniumProxy);

    driver = new ChromeDriver(capabilities);
    proxy.newHar("my_website_data");

Then we run some scripts on our website. The following is just a simple example as I do not wish to put here bots for browsing common website. I have prototyped some on browsing various videos on YouTube and on browsing different news portals. Generally this might be against the terms of service on public websites, so either use your own service you are testing (with a test service instance) or download a Wikipedia dump or something similar to run your tests on. Example code:

  public List<WebElement> listArticles() {
    List<WebElement> elements = driver.findElements(By.className("news"));
    List<WebElement> invisibles = new ArrayList<>();
    for (WebElement element : elements) {
      if (!element.isDisplayed()) {
        System.out.println("Not displayed:"+element);
        invisibles.add(element);
      }
    }
    elements.removeAll(invisibles);
    List<WebElement> articles = new ArrayList<>();
    List<String> descs = new ArrayList<>();
    for (WebElement element : elements) {
      List<WebElement> links = element.findElements(By.tagName("a"));
      for (WebElement link : links) {
        //we only take long links as this website has some "features" in the content portal causing pointless short links.
        //This also removes the "share" buttons for facebook, twitter, etc. which we do not want to hit
        //A better alternative might be to avoid links leading out of the domain 
        //(if you can figure them out..)
        //this is likely true for the ads as well..
        if (link.getText().length() > 20) {
          descs.add(link.getText());
          articles.add(link);
        }
      }
    }
    return articles;
  }

  public void openRandomArticle() throws Exception {
    List<WebElement> articles = listArticles();
    //sometimes our randomized user might hit a seemingly dead end on the article tree, 
    //in which case we just to the news portal main page ("base" here)
    if (articles.size() == 0) {
      driver.get(base);
      Thread.sleep(2000);
      articles = listArticles();
    }
    //this is a random choice of the previously filtered article list
    WebElement link = TestUtils.oneOf(articles);
    Actions actions = new Actions(driver);
    actions.moveToElement(link);
    actions.click();
    actions.perform();

    Har har = proxy.getHar();
    har.writeTo(new File(“my_website.har”));
    //if we just want to print it, we can do this..
    WebTestUtils.printHar(har);
    //or to drop stats in a database, do something like this
    WebTestUtils.influxHar(har);
  }

The code to access the data in the HAR file:

  public static void printHar(Har har) {
    HarLog log = har.getLog();
    List<HarPage> pages = log.getPages();
    for (HarPage page : pages) {
      String id = page.getId();
      String title = page.getTitle();
      Date started = page.getStartedDateTime();
      System.out.println("page: id=" + id + ", title=" + title + ", started=" + started);
    }
    List<HarEntry> entries = log.getEntries();
    for (HarEntry entry : entries) {
      String pageref = entry.getPageref();
      long time = entry.getTime();

      HarRequest request = entry.getRequest();
      long requestBodySize = request.getBodySize();
      long requestHeadersSize = request.getHeadersSize();
      String url = request.getUrl();

      HarResponse response = entry.getResponse();
      long responseBodySize = response.getBodySize();
      long responseHeadersSize = response.getHeadersSize();
      int status = response.getStatus();

      System.out.println("entry: pageref=" + pageref + ", time=" + time + ", reqBS=" + requestBodySize + ", reqHS=" + requestHeadersSize +
              ", resBS=" + responseBodySize + ", resHS=" + responseHeadersSize + ", status=" + status + ", url="+url);
    }
  }

So we can use this in many ways. Above I have just printed out some of the basic stats. Some example information available can be found on the internet, e.g. https://confluence.atlassian.com/display/KB/Generating+HAR+files+and+Analysing+Web+Requests. In the following I show some simple data from browsing a local newssite, visualized in Grafana using InfluxDB as a backend:

Here is some example code to write some of the HAR stats to InfluxDB:

  public static void influxHar(Har har) {
    HarLog harLog = har.getLog();
    List<HarPage> pages = harLog.getPages();
    for (HarPage page : pages) {
      String id = page.getId();
      String title = page.getTitle();
      Date started = page.getStartedDateTime();
      System.out.println("page: id=" + id + ", title=" + title + ", started=" + started);
    }
    List<HarEntry> entries = harLog.getEntries();
    long now = System.currentTimeMillis();
    int counter = 0;
    for (int i = index ; i < entries.size() ; i++) {
      HarEntry entry = entries.get(i);
      String pageref = entry.getPageref();
      long loadTime = entry.getTime();

      HarRequest request = entry.getRequest();
      if (request == null) {
        log.debug("Null request, skipping HAR entry");
        continue;
      }
      HarResponse response = entry.getResponse();
      if (response == null) {
        log.debug("Null response, skipping HAR entry");
        continue;
      }

      Map<String, Long> data = new HashMap<>();
      data.put("loadtime", loadTime);
      data.put("req_head", request.getHeadersSize());
      data.put("req_body", request.getBodySize());
      data.put("resp_head", response.getHeadersSize());
      data.put("resp_body", response.getBodySize());
      InFlux.store("browser_stat", now, data);
      counter++;
    }
    index += counter;
    System.out.println("count:"+counter);
  }

And the code to write the data into InfluxDB..

  private static InfluxDB db;

  static {
    if (Config.INFLUX_ENABLED) {
      db = InfluxDBFactory.connect(Config.INFLUX_URL, Config.INFLUX_USER, Config.INFLUX_PW);
      db.enableBatch(2000, 1, TimeUnit.SECONDS);
      db.createDatabase(Config.INFLUX_DB);
    }
  }

  public static void store(String name, long time, Map<String, Long> data) {
    if (!Config.INFLUX_ENABLED) return;
    Point.Builder builder = Point.measurement(name)
            .time(time, TimeUnit.MILLISECONDS)
            .tag("tom", name);
    for (String key : data.keySet()) {
      builder.field(key, data.get(key));
    }
    Point point = builder.build();
    log.debug("storing:"+point);
    //you should have enabled batch mode (as shown above) on the driver or this will bottleneck
    db.write(Config.INFLUX_DB, "default", point);
  }

And here are some example data visualized with Grafana for some metrics I collected this way:

Untitled

In the lower line chart, the number of elements loaded on click is shown. This refers to how many HTTP requests are generated per a WebDriver click on the website. The upper line/plot chart shows the minimum, maximum and average load times for different requests/responses. That is, how much time it took for the server to send back the HTTP responses for the clicks.

This shows how the first few page loads generated high number of HTTP requests/responses. After this, the amount goes down and stays quite steady at a lower level.  I assume this is due to the browser having cached much of the static content and not needing to request it all the time. Occasionally our simulated random browser user enters a slightly less explored path on the webpage, causing a small short term spike.

This nicely shows how modern websites end up generating surprisingly large numbers of requests. Also this shows how some requests are pretty slow to respond, so this might be a useful point to investigate for optimizing overall response time.

That’s it. Not too complicated but I find it rather interesting. Also does not require too many modifications to the existing WebDriver tests, just taking into use the proxy component, parsing the HAR file and writing to the database.