Remote Execution in PyCharm

Editing and Running Python Code on a Remote Server in PyCharm

Recently I was looking at an option to run some code on a remote server, while editing it locally. This time on AWS, but generally ability to do so on any remote server would be nice. I found that PyCharm has this nice option to use a Python SSH interpreter. Give it some SSH credentials, and point it to the Python interpreter on the remote machine, and you should be ready to go. Nice pic about it:


Sounds cool, and actually works really well. Even supports debugging. A related issue I ran into for pipenv also mentions profiling, pip package management, etc. Great. No, I haven’t tried all the advanced stuff yet, but at least the basics worked great.

Basic Remote Use

I made this simple program to test this feature:

print("hello world")
with open("bob.txt", "w") as bob:


The point is to print text to the console and create a file. I am looking to see that running this remotely will show me the prints locally, and create the file remotely. This would confirm to me that the execution happens remotely, while I edit, control execution, and see the results locally.

Running this locally prints "hello world" followed by "oops" and a file named "hello.txt" appears. Great.

To try remotely, I need to set up a remote Python interpreter in PyCharm. This can be done via project preferences:

Add interpreter

Or by clicking the interpreter in the status bar:

Statusbar interpreter

On a local configuration this shows the Python interpreter (or pipenv etc.) on my computer. In remote configuration it asks for many options such as remote server IP and credentials. All the run/debugging traffic between local and remote machines is then automatically transferred over SSH tunnels by PyCharm. To start, select SSH interpreter as type when adding new interpreter:

SSH interpreter

Just enter the remote IP/URL address, and username. Click next to enter also password/keyfile. PyCharm will try to connect and see this all works. On the final page of the remote interpreter dialog, it asks for the interpreter path:

Remote Python config

This is referring to the python executable on the remote machine. A simple which python3 does the trick. This works to run the code using the system python on the remote machine.

To run this remote configuration, I just press the run button as usual in PyCharm. With this, PyCharm uploads my project files to the remote server over SSH, starts the interpreter there for the given configuration, and transports back to my local host the console output of the execution. For me it looks exactly the same as running it locally. This is the output of running the above configuration:

ssh://ec2-user@ -u /tmp/pycharm_project_411/
hello world

The first line shows some useful information. It shows that it is using the SSH interpreter with the given IP and username, with the configured Python path. It also shows the directory where it has uploaded my project files. In this case it is "/tmp/pycharm_project_411". This is the path defined in Project Interpreter settings in the Path Mappings part, as illustrated higher above in image (with too many red arrows) in this post. OK, the attached image further above has a different number due to playing with different projects but anyway. To see the files and output:

[ec2-user@ip-172-31-3-125 ~]$ cd /tmp/pycharm_project_411/
[ec2-user@ip-172-31-3-125 pycharm_project_411]$ ls

This is the file listing from the remote server. PyCharm has uploaded the "" file, since this was the only file I had in my project (under project root as configured for synch in path mappings). There is a separate tab on PyCharm to see these uploads:

Remote synch

After syncing the files, PyCharm has executed the configuration on the remote host, which defined to run the file. And this execution has create the file "bob.txt" as it should (on remote host). The output files go in this remote target directory, as it is the working directory for the running python program.

Another direction to synchronize is from the remote host to local. Since PyCharm provides intelligent coding assistance and navigation on local system, it needs to know and install the libraries used by the executed code. For this reason, it installs all the packages installed in the remote host Python environment. Something to keep in mind. I suppose it must install some type of a local virtual environment for this. Haven’t needed to look deeper on that yet.

Using a Remote Pipenv

The above discusses the usage of standard Python run configuration and interpreter. Something I have found useful for Python environemnts is pipenv.

So can we also do a remote execution of a remote pipenv configuration? The issue I linked earliner contains solutions and discussion on this. Basically, the answer is, yes we can. Just have to find the pipenv files on the remote host and configure the right one as the remote interpreter.

For more complex environments, such as those set up with pipenv, a bit more is required. The issue I linked before had some actual instructions on how to do this:

Remote pipenv config

I made a directory "t" on the remote host, and initialized pipenv there. Installed a few dependencies. So:

  • mkdir t
  • cd t
  • pipenv install pandas

And there we have the basic pipenv setup on the remote host. To find the pipenv dir on remote host (t is the dir where pipenv was created above):

[ec2-user@ip-172-31-3-125 t]$ pipenv --venv

To see what it contains:

[ec2-user@ip-172-31-3-125 t]$ ls /home/ec2-user/.local/share/virtualenvs/t-x5qHNh_c
bin  include  lib  lib64  src
[ec2-user@ip-172-31-3-125 t]$ ls /home/ec2-user/.local/share/virtualenvs/t-x5qHNh_c/bin
activate       activate.ps1      chardetect        pip     python     python-config
activate.csh  easy_install      pip3    python3    wheel  activate.xsh      easy_install-3.7  pip3.7  python3.7

To get python interpreter name:

[ec2-user@ip-172-31-3-125 t]$ pipenv --py

This is just a link to python3:

[ec2-user@ip-172-31-3-125 t]$ ls -l /home/ec2-user/.local/share/virtualenvs/t-x5qHNh_c/bin/python
lrwxrwxrwx 1 ec2-user ec2-user 7 Nov  7 20:55 /home/ec2-user/.local/share/virtualenvs/t-x5qHNh_c/bin/python -> python3

Use that to configure this pipenv as remote executor, as shown above already:

Remote pipenv config


I haven’t used this feature on a large scale yet, but it seems very useful. The issue I keep linking discusses one option of using it to run data processing on a large desktop system from a laptop. I also find it interesting for just running experiments in parallel on a separate machine, or for using cloud infrastrucure while developing.

The issue also has some discussion with potential pipenv management from PyCharm coming in 2020.1 or 2020.2 version. Just speculation, of course. But until then one can set up the virtualenv using pipenv on remote host and just use the interpreter path above to set up the SSH Interpreter. This works to run the code inside the pipenv environment.

Some issues I ran into included PyCharm apparently only keeping a single state mapping in memory for remote and local file diffs. PyCharm synchronizes files very well, and identifies changes to upload new files. But if I change the remote host address, it seems to still think it has the same delta. Not a big issue, but something to keep in mind as always.

That’s all.

Robot Framework by Examples


Robot Framework (RF) is a popular keyword driven test framework (at least in Finland it seems to be..). Recently had to look into it again for some potential work related opportunities. Have to say open source is great but the docs could use improvements..

I made a few examples for the next time I come looking:


To install RF itself, in Python pip does the job. Installing RF itself, along with Selenium keywords, and Selenium Webdriver for those keywords:

pip3 install robotframework
pip3 install selenium
pip3 install robotframework-seleniumlibrary

Using Selenium WebDriver as an example here, a Selenium driver for the selected browser is needed. For Chrome, one can be downloaded from the Chrome website itself. Similarly for other browsers on their respective sites. The installed driver needs to be on the search path for the operating system. On macOS, this is as simple as adding it to the path. Assuming the driver is in currect directory:


So just the dot, which works as long as the driver file is in the working directory when running the tests.

In PyCharm, the PATH can also be similarly added to run configuration environment variables.

General RF Script Structure

RF script elements are separated by minimum of 2 space indentation. Both indenting test steps under a test, and also to separate keywords and parameters. There is also the pipe separated format which might look a bit fancier, if you like. Sections are identified by three stars *** and a pre-defined name for the section.

The following examples illustrate.


Built-in Keywords / Logging to console

The built-in keywords are avaiable without needing to import a specific library. Rather they are part of the built-in library. Simple example of logging some statement to console:

The .robot script (hello.robot in this case):

*** Test Cases ***
Say Hello
    Log To Console    Hello Donkey
    No Operation
    Comment           Hello ${bob}

The built-in keyword "Log To Console" writes the given parameter to the log file. A hello world equivalent. To run the test, we can either write code to invoke the RF runner from Python or use RF command line tools. Python runner example:

from robot.running import TestSuiteBuilder
from robot.api import ResultWriter

suite = TestSuiteBuilder().build("./hello.robot")
result ="test_output.xml")
#ResultWriter(result).write_results(report='report.html', log="log.html")
ResultWriter("test_output.xml").write_results(report='report.html', log="log.html")

The "hello.robot" in above is the name of the test script file listed above also.

The strangest thing (for me) here is the writing of the log file. The docs suggest to use the first approach I commented out above. The ResultWriter with the results object as a parameter. This generates the report.html and the log.html.

The problem is, the log.html is lacking all the prints, keywords, and test execution logs. Later on the same docs state that to get the actual logs, you have to pass in the name of the XML file that was created by the method. This is the uncommented approach in the above code. Since the results object is also generated from this call, why does it not give the proper log? Oh dear. I don’t understand.

Commandline runner example:

robot hello.robot

This seems to automatically generate an appropriate log file (including execution and keyword trace). There are also a number of command line options available, for all the properties I discuss next using the Python API. Maybe the general / preferred approach? But somehow I always end up needing to do my own executors to customize and integrate with everything, so..

Finally on logging, Robot Framework actually captures the whole stdout and stderr, so statements like print() get written to the RF log and not to actual console. I found this to be quite annoying and resulting in overly verbose logs with all the RF boilerplate/overhead. There is a StackOverflow answer on how to circumvent this though, from the RF author himself. I guess I could likely write my own keyword to use that if needed to get more log customization, but seems a bit complicated.

Tags and Critical Tests

RF tags are something that can be used to filter and group tests. One use is to define some tests as "critical". If a critical test fails, the suite is considered failed.

Example of non-critical test filtering. First, defining two tests:

*** Test Cases ***
Say Hello Critical
	[Tags]            crit
    Log To Console    Hello Critical Donkey
    No Operation
    Comment           Hello ${bob}

Say Hello Non-Critical
	[Tags]            non-crit
    Log To Console    Hello Nice Donkey
    No Operation
    Comment           Hello ${bob}

Running them, while filtering with wildcard:

from robot.running import TestSuiteBuilder
from robot.api import ResultWriter

suite = TestSuiteBuilder().build("./noncritical.robot")
result ="test_output.xml", noncritical="*crit")
ResultWriter("test_output.xml").write_results(report='report.html', log="log.html")

The above classifies all tests that have tags matching the regexp "*crit" as non-critical. In this case, it includes both the tags "crit" and "non-crit", which would likely be a bit wrong. So the report for this actually shows 2 non-critical tests.

The same execution with a non-existent non-critical tag:

from robot.running import TestSuiteBuilder
from robot.api import ResultWriter

suite = TestSuiteBuilder().build("./noncritical.robot")
#this tag does not exist in the given suite, so no critical tests should be listed in report
result ="non")
ResultWriter(result).write_results(report='report.html', log="log.html")

This runs all tests as critical, since no test has a tag of "non". To finally fix it, the filter should be exactly "non-crit". This would not match "crit" but would match exactly "non-crit".

Filtering / Selecting Tests

There are also keywords include and exclude. To include or exclude (surprise) tests with matching tags from execution.

A couple of tests with two different tags (as before):

*** Test Cases ***
Say Hello Critical
	[Tags]            crit
    Log To Console    Hello Critical Donkey
    No Operation
    Comment           Hello ${bob}

Say Hello Non-Critical
	[Tags]            non-crit
    Log To Console    Hello Nice Donkey
    No Operation
    Comment           Hello ${bob}

Run tests, include with wildcard:

from robot.running import TestSuiteBuilder
from robot.api import ResultWriter
from io import StringIO

suite = TestSuiteBuilder().build("./include.robot")
stdout = StringIO()
result ="*crit", stdout=stdout)
ResultWriter(result).write_results(report='report.html', log="log.html")
output = stdout.getvalue()

This includes both of the two tests defined above, since the tags match. If the filter was "non", nothing would match, and error is produced for no tests to run.

Creating new Keywords from Existing Keywords

Besides somebody elses keywords, custom keywords can be extended from existing keywords. Example test file:

*** Settings ***
Resource    simple_keywords.robot

*** Test Cases ***
Run A Google Search
    Search for      chrome    emoji wars
    Sleep           10s
    Close All Browsers

The included (by the Resource keyword above) file simple_keywords.robot:

*** Settings ***
Library  SeleniumLibrary

*** Keywords ***
Search for
    [Arguments]    ${browser_type}    ${search_string}
    Open browser   ${browser_type}
    Press Keys      name:q    ${search_string}+ENTER

So the keyword is defined above in a separate file, with arguments defined using the [Arguments] notation. Followed by the argument names. Which are then referenced in following keywords, Open Browser and Press Keys, imported from SeleniumLibrary. Simple enough.

Selenium Basics on RF

Due to popularity of Selenium Webdriver and testing of web applications, there is a specific RF library with keywords built for it. This was installed way up in Installing section.

Basic example:

*** Settings ***
Library  SeleniumLibrary

*** Test Cases ***
Run A Google Search
    Open browser   Chrome
    Press Keys      name:q    emoji wars+ENTER
    Sleep           10s
    Close All Browsers

Run it as always before:

from robot.running import TestSuiteBuilder
import robot

suite = TestSuiteBuilder().build("./google_search.robot")
result =

This should open up Chrome browser, load Google on it, do a basic search, and close the browser windows. Assuming it founds the Chrome driver also listed in the Installing section.

Creating New Keywords in Python

Besides building keywords as composites of existing ones, building new ones with Python code is an option.

Example test file:

*** Settings ***
Library    chrome

*** Test Cases ***
Run A Google Search
    Search for      emoji wars
    Sleep           10s

The above references, where the implementation is:

from import By
from import WebDriverWait
from import expected_conditions as EC
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

class google_search_lib(object):
    driver = None

    def get_driver(cls, browser):
        if cls.driver is not None:
            return cls.driver
        if (browser.lower()) == "chrome":
            cls.driver = webdriver.Chrome("../chromedriver")
        return cls.driver

    def __init__(self, browser):
        driver = google_search_lib.get_driver(browser)
        self.driver = driver
        self.wait = WebDriverWait(driver, 10)

    def search_for(self, term):
        search_box = self.driver.find_element_by_name("q")

    def close(self):

Defining the library import names is a bit tricky. If it is the same in both cases (module + class) just one is needed.

Again, running it as before:

from robot.running import TestSuiteBuilder

suite = TestSuiteBuilder().build("./google_search.robot")
result =

If you think about this for a moment, there is some strange magic here. Why is the classmethod there? How is state managed within tests / suites? I borrowed the initial code for this example from this fine tutorial. It does not discuss the use of this annotation, but it seems to me that this is used to shared the driver object during test execution.

Mapping Python Functions to Keywords

It is simply by taking the function name and underscores for space. So in the above example, the Search For maps to the search_for() function. Close keyword maps to close() function. Much complex, eh?

Test Setup and Teardown

Test setup and teardown are some basic functionality. This is supported in RF by specific keywords in the Settings section.

Example test file:

*** Settings ***
Library    chrome
Test Setup      Log To Console    Starting a test...
Test Teardown   Close

*** Test Cases ***
Run A Google Search
    Search for      emoji wars
    Sleep           10s

The referenced file is the same as above. This includes defining the close function / keyword used in Test Teardown.

Run it as usual:

from robot.running import TestSuiteBuilder

suite = TestSuiteBuilder().build("./google_search.robot")
result =

You can define a single keyword for both setup and teardown. RF docs suggest to write your own custom keyword, composing multiple actions as needed.

The way the library class is defined and created is also impacted on how the scope of the library is defined. It seems to get a bit tricky to manage the resources, since sometimes the instances are different in the setup, teardown, tests, or in all tests. I think this is one of the reasons for using the classmethod annotation in the tutorial example I cited.

There would be much more such as variables in tests. And RF also supports the BDD (Gherkin) syntax in addition to the keywords I showed here. But the underlying framework is quite the same in both cases.

Anyway, that’s all I am writing on today. I find RF is quite straightforward once you get the idea, and not too complex to use even with the docs not being so straightforward. Overall, a very simple concept, and I guess one that the author(s) have managed to build a reasonable community around. Which I guess is what makes it useful and potentially successfull.

I personally prefer writing software over putting keywords after one another, but for writing tests I guess this is one useful method. And maybe there is an art in itself to writing good, suitably abstracted, reusable yet concrete keywords?

That’s all, folks,…

A Look into AWS Elastic Container Service


Recently, I got myself the AWS Certified Associate Solutions Architecture certificate. To start getting more familiar with it, I did the excellent Cloud Guru class on the certificate preparation on Udemy. One part that was completely missing from that preparatory course was ECS. Yet questions related to it came up in the exam. Questions on ECS and Fargate, among others. I though maybe Fargate is something from Star Trek. Enter the Q continuum? But no.

Later on, I also went through the Backspace preparatory course on Udemy, which briefly touches on ECS, but does not really give any in-depth understanding. Maybe the certificate does not require it, but I wanted to learn it to understand the practical options on working with AWS. So I went on to explore.. and here it is.

Elastic Container Service (ECS) is an AWS service for hosting and running Docker images and containers.

ECS Architecture

The following image illustrates the high-level architecture, components, and their relations in ECS (as I see it):

ECS High-Level Architecture

The main components in this:

  • Elastic Container Service (ECS): The overarching service name that is composed of the other (following) elements.
  • Elastic Container Registry (ECR): basically handles the role of private Docker Hub. Hosts Docker images (=templates for what a container runs).
  • Docker Hub: The general Docker Hub on the internet. You can of course use standard Docker images and templates on AWS ECS as well.
  • Docker/task runner: The hosts running the Docker containers. Fargate or EC2 runner.
  • Docker image builder: Docker images are built from specifications given in a DockerFile. The images can then be run in a Docker container. So if you want to use your own images, you need to build them first, using either AWS EC2 instances, or your own computers. Upload the build images to ECR or Docker Hub. I call the machine used to do the build here "Docker Image Builder" even if it is not an official term.
  • Event Sources: Triggers to start some task running in an ECS Docker container. ELB, Cloudwatch and S3 are just some examples here, I have not gone too deep into all the possibilities.
  • Elastic Load Balancer (ELB): To route incoming traffic to different container instances/tasks in your ECS configuration. So while ELB can "start tasks", it can also direct traffic to running tasks.
  • Scheduled tasks: Besides CloudWatch events, ECS tasks may be manually started or scheduled over time.

Above is of course a simplified description. But it should capture the high level idea.

Fargate: Serverless ECS

Fargate is the "serverless" ECS version. This just means the Docker containers are deployed on hosts fully managed by AWS. It reduces the maintenance overhead on the developer/AWS customer as the EC2 management for the containers is automated. The main difference being that there is no need to define the exact EC2 (host) instance types to run the container(s). This seems like a simply positive thing for me. Otherwise I would need to try to calculate my task resource definitions vs allocated containers, etc. So without Fargate, I need to manage the allocated vs required resources for the Docker containers manually. Seems complicated.

Elastic Container Registry / ECR

ECR is the AWS integrated, hosted, and managed container registry for Docker images. You build your images, upload them to ECR, and these are then available to ECS. Of course, you can also use Docker Hub or any other Docker registry (that you can connect to), but if you run your service on AWS and want to use private container images, ECR just makes sense.

When a new Docker container is needed to perform a task, the AWS ECS infrastructure can then pull the associated "container images" from this registry and deploy them in ECS host instances. The hosts being EC2 instances with the ECS-agent running. The EC2 instances managed either by you (EC2 ECS host type) or by AWS (Fargate).

Since hosting custom images with own code likely includes some IPR you don’t want to share with everyone, ECR is encrypted, as well as all communication with it. There are also ECS VPC Endpoints available to further secure the access and to reduce the communication latencies, removing public Internet roundtrips, with the ECR.

As for availability and reliability, I did not directly find good comments on this, except that the container images and ECR instances are region-specific. While AWS advertises ECR as reliable and scalable and all that, I guess this means they must simply be replicated within the region.

Besides being region-specific, there are also some limitations on the ECS service. But these are in the order of max 10000 repositories per region, each with max of 10000 images. And up to 20 docker pull type requests per second, bursting up to 200 per second. I don’t see myself going over those limits, pretty much ever. With some proper architecting, I do not see this generally happening or these limits becoming a problem. But I am not running Netflix on it, so maybe someone else has it bigger.

ECS Docker Hosting Components

The following image, inspired by a Medium post (thanks!), illustrates the actual Docker related components in ECS:

ECS Docker Components

  • Cluster: A group of ECS container instances (for EC2 mode), or a "logical grouping of tasks" (Fargate).
  • Container instance: An EC2 instance running the ECS-agent (a Go program, similar to Docker daemon/agent).
  • Service: This defines what your Docker tasks are supposed to do. It defines the configuration, such as the Task Defition to run the service, the number of task instances to create from the definition, and the scheduling policy. I see this as a service per task, but defining also how multiple instances of the tasks work together to implement a "service", and their related overall configuration.
  • Task Definition: Defines the docker image, resources (CPU, memory), instance type (micro, nano, macro, …), IAM roles, image boot command, …
  • Task Instance: An instantiation of a task definition. Like docker run on your own host, but for the ECS.

Elastic Load Balancer / ELB with ECS

The basic function of a load balancer is to spread the load for an ECS service across its multiple tasks running on different host instances. Similar to "traditional" EC2 scaling based on monitored ELB target health and status metrics, scaling on ECS can also be triggered. Simply based on ECS tasks vs pure EC2 instances in a traditional setting.

As noted higher above, an Elastic Load Balancer (ELB) can be used to manage the "dynamics" of the containers coming and going. Unlike in a traditional AWS load balancer setting, with ECS, I do not register the containers to the ELB as targets myself. Instead, the ECS system registers the deployed containers as targets to the ELB target group as the container instances are created. The following image illustrates the process:

ELB with ECS

The following points illustrate this process:

  • ELB performs healthchecks on the containers with a given configuration (e.g., HTTP request on a path). If the health check fails (e.g., HTTP server does not respond), it terminates the associated ECS task and starts another one (according to defined ESC scaling policy)
  • Additionally there are also ECS internal healthchecks for similar purposes, but configured directly on the (ECS) containers.
  • Metrics such as Cloudwatch monitoring ECS service/task CPU loads can be used to trigger autoscaling, to deploy new tasks for a service (up-scaling) or remove excess tasks (down-scaling).
  • As requests come in, they are forwarded to the associated ECS tasks, and the set of tasks may be scaled according to the defined service scaling policy.
  • When a new task / container instance is spawned, it registers itself to the ELB target group. The ELB configuration is given in the service definition to enable this.
  • Additionally, there can be other tasks not associated to the ELB, such as scheduled tasks, constantly running tasks, tasks triggered by Cloudwatch events or other sources (e.g., your own code on AWS), …

Few points that are still unclear for me:

  • An ELB can be set to either instance or port type. I experimented with simple configurations but had the instance type set. Yet the documentation states that with awsvpc network type I should use IP based ELB configuration. But it still seemed to work when I used instance-type. Perhaps I would see more effect with larger configurations..
  • How the ECS tasks, container instances, and ELBs actually relate to each other. Does the ELB actually monitor the tasks or the container instances? Does the ELB instance vs port type impact this? Should it monitor tasks but I set it to monitor instances, and it worked simply because I was just running a single task on a single instance? No idea..

Security Groups

As with the other services, such as Lambda in my previous post, to be able to route the traffic from the ELB to the actual Docker containers running your code, the security groups need to be configured to allow this. This would look something like this:


Here, the ELB is allowed to accept connections from the internet, and to make connections to the security group for the containers. The security groups are:

  • SG1: Assigned to the ELB. Allows traffic in from the internet. Because it is a security group (not a network access control list]), traffic is also allowed back out if allowed in.
  • SG2: Assigned to the ECS EC2 instances. Allows traffic in from SG1. And back out, as usual..

Final Thoughts

I found ECS to be reasonably simple, and providing useful services to simplify management of Docker images and containers. However, Lambda functions seem a lot simpler still, and I would generally use those (as the trend seems to be..). Still, I guess there are still be plenty of use cases for ECS as well. For those with investments into, or otherwise preferences for containers, and for longer running tasks, or otherwise tasks less suited for short invocation Lambdas.

As the AWS ECS-agent is just an open-source program written in Golang and hosted on Github, it seems to me that it should be possible to host ECS agents anywhere I like. Just as long as they could connect to the ECS services. How well that would work from outside the core AWS infrastructure, I am not sure. But why not? Have not tried it, but perhaps..

Looking at ECS and Docker in general, Lambda functions seem like a clear evolution path from this. Docker images are composed from DockerFiles, which build the images from [stacked layers(, which are sort of build commands. The final layer being the built "product" of those layered commands. Each layer in a Dockerfile builds on top of the previous layer.

Lambda functions similarly have a feature called Lambda Layers, which can be used to provide a base stack for the Lambda function to execute on. They seem a bit different in defining sets of underlying libraries and how those stack on top of each other. But the core concept seems very similar to me. Finally, the core a Lambda function is the function that executes when a triggering event runs, similar to the docker run command in ECS. Much similar, such function.

The main difference for Lambda vs ECS perhaps seems to be in requiring even less infrastructure management from me when using Lambda (vs ECS). The use of Lambda was illustrated in my earlier post.

S3 Policies and Access Control

Learning about S3 policies.

AWS S3 bucket access can be controlled by S3 Bucket Policies and by IAM policies. If using combinations of both, there is an AWS blog post showing how the permissions get evaluated:

  • If there is an explicit deny in Bucket Policies or IAM Policies, access is denied. Even if another rule would allow it, a single deny trumps any allows that would also match.
  • If there is an explicit allow in Bucket Policies or IAM Policies, and no deny, access is allowed.
  • If there is no explicit rule matching the request/requested, the access is denied.

An example JSON bucket policy is given as:

  "Version": "2012-10-17",
  "Statement": [
      "Effect": "Allow",
      "Principal": {
        "AWS": ["arn:aws:iam::111122223333:user/Alice",
      "Action": "s3:*",
      "Resource": ["arn:aws:s3:::my_bucket",

Policy Elements

The elements of the policy language, as used in above example:

  • Version is the date when a version was defined:

    • "2012-10-17" is practically version 2 of the policy language.
    • "2008-10-17" is version 1, and the default if no version is defined.
    • The newer version allows use of policy variables and likely some other minor features.
    • AWS recommends to always use the latest version definition.
  • Statement is cunningly named, since (as far as I understand), there can only be one such element in the policy. But it contains "multiple statements", each defining some permission rule.

  • Effect: Deny or Allow

  • Principal: Who is effected by this statement.

  • NotPrincipal: Can be used to define who should NOT be affected by this rule. For example, apply to all but the root account by using this to exclude root.

  • Action: The set of operations (actions) that this affects. For example "s3:*" means the rule should affect (deny/allow) all S3 operations.

  • NotAction: Opposite of Action, so "s3:*" here would mean the rule applies to all actions except those related to S3.

  • Resource: The resources the rule should apply to. Here it is the given bucket and all files inside the bucket.

  • NotResource: The resources the rules should not apply to. Giving a specific bucket would mean the rule should apply to all buckets except that one. Or perhaps all resources (including S3 but anything else AWS considers a resource)..?


The Principal can be defined as single user, a wildcard, or a list of users. Examples:


      "Principal": {
        "AWS": ["arn:aws:iam::111122223333:user/Alice",

In above, user/Alice simply refers to an IAM user named Alice. This is the user name given in the IAM console when creating/editing the user. Root refers to the account root user.

Single user:

    "Principal": { "AWS": "arn:aws:iam::123456789012:root" }

The Principal docs also refer to something with a shorter notation:

    "Principal": { "AWS": "123456789012" }

I can confirm that the above style gives no errors in defining the policy, which always seems to be immediately verified by AWS. But I could not figure out what this would be used for exactly. I mean, I expected this to be a rule affecting every user for the account. Did not seem to be so for my account. I posted some question, will get back to it if I get an answer..

Example Policies

Here follows a few example policies for practical illustration. Using three different users:

  • root user,
  • user "bob" with S3 admin role, and
  • "randomuser" with no S3 access defined.

These use an example account with ID 123456789876, and a bucket with name "policy-testing-bucket":

First misconception

I was thinking this one was needed to give Bob and the root account access to the bucket and all files inside it. Including all the S3 operations:

    "Version": "2012-10-17",
    "Statement": [
            "Effect": "Allow",
            "Principal": {
                "AWS": [
            "Action": "s3:*",
            "Resource": [

The above seemed to work in allowing bob access. But not really.

Only allor root by bucket policy

The following policy should stop Bob from accessing the bucket and files, since by default all access should be denied, and this one does not explicitly allow Bob. And yet Bob can access all files.. More on that follows. The policy in question, that does not explicitly allow Bob:

    "Version": "2012-10-17",
    "Statement": [
            "Effect": "Allow",
            "Principal": {
                "AWS": [
            "Action": "s3:*",
            "Resource": [

Denying Bob

Since the above did not stop Bob from accessing the files, how about the following to explicitly deny him:

    "Version": "2012-10-17",
    "Statement": [
            "Effect": "Allow",
            "Principal": {
                "AWS": [
            "Action": "s3:*",
            "Resource": [
            "Effect": "Deny",
            "Principal": {
                "AWS": [
            "Action": "s3:*",
            "Resource": [

The above works to stop Bob from accessing the bucket and files, as a Deny rule always trumps an Allow rule, regardless of their ordering.

IAM roles and their interplay with bucket policies

I realized the above is all because I put Bob in a "developer" group I had created, and this group has a general policy as:

  "Version": "2012-10-17",
  "Statement": [
      "Effect": "Allow",
      "Action": "*",
      "Resource": "*"

This allows anyone in this group (including Bob) to access any resources with any actions. Including resources in S3, and all actions in S3. So the explicit Allow rule higher above is not needed, Bob already has access via the his IAM group policy. This is why hist access works even if I remove him from the bucket policy. And if an explicit deny is defined, as also higher above, then the Deny trumps the IAM role Allow and blocks the access.

User with no IAM permissions

Allow listing bucket content and opening files

Since Bob was in developer group and thus has access granted via his IAM role, I also tried with RandomUser, who has no IAM role, and should be directly impacted by the S3 policy. This policy gives RandomUser access to list bucket contents and open files. but not to see bucket list, or the bucket in that list:

    "Version": "2012-10-17",
    "Statement": [
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::123456789876:user/arandomuser"
            "Action": "s3:*",
            "Resource": [

Block filelist view, allow file access

This stops RandomUser from seeing the filelist for the bucket but allows to open specific files in it:

    "Version": "2012-10-17",
    "Statement": [
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::123456789876:user/arandomuser"
            "Action": "s3:*",
            "Resource": [

Allow filelist, block file opening

This lets RandomUser see the bucket filelist but prevents opening files in it:

    "Version": "2012-10-17",
    "Statement": [
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::123456789876:user/arandomuser"
            "Action": "s3:*",
            "Resource": [

If not allowed by IAM or Bucket Policy, denied by default

If RandomUser is now allowed explicitly in bucket policies, he gets access denied:

    "Version": "2012-10-17",
    "Statement": [
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::123456789876:root"
            "Action": "s3:*",
            "Resource": [

Adventures in AWS: Scraping with Lambdas

Amazon Web Service (AWS) are increasingly popular in the IT business and job market. To learn more about them, I took one of the AWS certification courses from A Cloud-Guru on Udemy for 10 euros. Much expensive, but investment. There are no certification exam locations where I live but maybe some day I have a chance to try it. In the mean time, learning is a possibility regardless. The course was very nice, including some practice labs. I could probably pass the test with just that, but doing a small experiment/exercise of my own tends to make it a bit more clear and gives some concrete experience.

So I built myself a small "service" using AWS. It collects chat logs from a public internet chat-service (scraping), collects them into a database, and would provide some service based on a machine-learning model trained on the collected data. I created a Discord test server and tried scraping that to see it works, and requested a Twitter development account to try that as well later. In this post I describe the initial data collection service part, which is plenty enough for one post. Will see about the machine learning and API service on it later. Maybe an interactive Twitter bot or something.


The high-level architecture and the different AWS services I used is shown in the following figure:

high-level architecture

The components in this figure are:

  • VPC: The main container for providing a "virtual private cloud" for me to do my stuff inside AWS.
  • AZ: A VPC is hosted in a region (e.g., eu-central, also known as Frankfurt in AWS), which has multiple Availability Zones (AZ). These are "distinct locations engineered to be isolated from failures in other AZs". Think fire burning a data center down or something.
  • Subnets split a VPC into separate parts for more fine-grained control. Typically one subnet is in one AZ for better resiliency and resource distribution.
  • Private vs public subnet: Public subnet has an internet gateway defined so you can give it a public IP addresses, access internet from within it, and allow incoming connections. Private has none of that.
  • RDS: Maria DB in this case. This is a relational database system (RDS), provided by AWS as a service.
  • S3 Endpoint: Provides direct link from the subnet to S3. Otherwise S3 access would be routed through internet. S3 is Simple Storage Service, AWS file object store.
  • Internet gateway: provides a route to the internet. Otherwise nothing in the subnet can access the internet outside VPC.
  • EC2 instance: Plain virtual machine. I used it to access the RDS with MariaDB command line tools, from inside the VPC.
  • Lambda Functions: AWS "serverless" compute components. You upload the code to AWS, which deploys, runs, and scales it based on given trigger events as needed.
  • Scraper Lambda: Does the actual scraping. Runs in the public subnet to be able to access the internet. Inserts the scraped data into S3 as a single file object once per day (or defined interval).
  • Timestamp Lambda: Reads the timestamps of latest scraped comments per server and chat channel, so Scraper Lambda knows what to scrape.
  • DB Insert Lambda: Reads the scraper results from S3, inserts them into the RDS.
  • S3 chat logs: S3 bucket to store the scraped chat logs. As CSV files (objects in S3 terms).

In the above architecture I have the Scraper Lambda outside the VPC, and the other two Lambda inside the VPC. A Lambda inside the VPC can be configured to have access to the resources within the VPC, such as the RDS database. But if I want to access the Internet from an in-VPC Lambda, I need to add a NAT-Gateway. A Lambda outside the VPC, such as the Scraper Lambda here, has access to the Internet, so it needs no specific configuration for that. But being outside the VPC, it does not have access to the in-VPC RDS, so it needs to communicate with the in-VPC Lambda functions for that.

The dashed arrows simply show possible communications. The private subnets have no route to the internet but can communicate with the other subnets. This can be further constrained by various security configurations that I look at later.


Another option would be to use a NAT-GW (NAT-Gateway) to put the Scraper Lambda also inside the VPC, as illustrated by this architecture:

NAT-GW architecture

A NAT-GW provides access to the internet from within a private subnet by using Network Address Translation (NAT). So it routes traffic from private subnets/private network interfaces through the Internet Gateway (via a public subnet). It does not provide a way to access the private subnet from the outside, but that would not be required here. This is illustrated by the internet connection arrows in the figure, where the private subnets would pass through the NAT gateway, which would pass the traffic out through the internet gateway. But there is no arrow in the other direction, as there is no way to provide a connection interface to the private subnets from the internet in this.

A NAT-GW here could both complicate or simplify things. With NAT-GW, I could combine the Timestamp Lambda with the Scraper Lambda, and just have the Scraper Lambda read the timestamps direct from the RDS itself. This is illustrated in the architecture diagram above.

In detail, there are two ways a Lambda can be invoked by another Lambda. Asynchronous and synchronous. In synchronous, the calling Lambda waits for the results of the call before proceeding. Asynchronous just starts another Lambda in parallel, without further connection. The Scraper->Timestamp is a synchronous call as Scraper requires the Timestamp information to proceed. This is the only use for Timestamp Lambda in this architecture. So, if possible, these could be combined.

In this option, I would still keep the DB Insert Lambda separate, as it can run asynchronously on its own, reading the latest data from S3 without any direct link to anything else. In this way, I find the use of Lambdas can also help keep the independent functions separate. For me, this commonly leads to better software design.

However, a NAT-GW is a billable service, costs money and is not included in the free tier. OK, it is from about 1 euro per day up, depending on the bandwidth used. For a real company use-case, this would likely be rather negligible cost. But for poor me and my little experiments.. And this current architecture let me try some different configurations so, sort of win-win.

Service Endpoints

The following two figures illustrate the difference of data transfer when using the S3 endpoint vs not:

With S3 endpoint:

with S3 endpoint

Without the S3 endpoint:

without S3 endpoint

So how do these work? As far as I understand, both interface and gateway endpoints are based on routing and DNS tricks in the associated subnets. Again, getting into further details seems to get a bit complicated. For a gateway endpoint, such as S3 endpoint, the endpoint must be added to the subnet route table to work. But what happens if you do not add it, but do have an internet connection? My guess it, it will still be possible to connect to S3 but you will be routed through the internet. How AWS handles the DNS requests internally, and do you have visibility into the actual routes that are taken in real-time/during operation? I don’t know.

In any case, as long as using the DNS-style names to access the services, the AWS infrastructure should do the optimal routing via endpoints if available. For interface endpoints, the documents mention something called Private DNS, which seems to do a similar thing. Except it does not seem to use similar route table mappings as Gateway endpoints. I guess the approach for making use of endpoints when possible would be to use the service DNS-style names, and consistently review all route tables and other configs. As this seems like a possibly common and general problem, perhaps there are some nice tools for this but I don’t know.

It seems to me it makes a lot more sense to use such endpoint services to connect directly to the AWS services, since we are already running withing AWS. In fact, it seems strange that the communication would otherwise (and I guess before the S3 endpoints existed, the only option) by default take a detour through the internet.

Use of endpoints seems much more effective in terms of performance and bandwidth use, but also so in terms of cost. Traffic routed through the internet gets billed separately in AWS, whereas gateway endpoint traffic stays within the AWS and is thus not separately billed. Meaning, in my understanding, that the S3 endpoint is "free". So why would you not ever use it? No idea.

This is just the S3 endpoint. AWS has similar endpoints for most (if not all) its services. The S3 endpoint is a "gateway endpoint", and one other service that currently supports this is DynamoDB. Other services have what is called an "interface endpoint". Which seems to be a part of something called PrivateLink. Whatever that means.. These cost money, both for hourly use and bandwidth.

With a quick Internet search, I found the Cloudonaut page to be a bit more clear on the pricing. But I guess you never know with 3rd party sites if they are up to date to latest changes. Would be nice if Amazon would provide some nice and simple way to see pricing for all, now I find it a bit confusing to figure out.

Lambda Triggers

The AWS Lambda functions can be triggered from multiple sources. I have used two triggers here:

  • Scheduled time trigger from Cloudwatch. Triggers the Scraper Lambda once a day.
  • Lambda triggering another Lambda synchronously. The Scraper Lambda invokes the Timestamp Lambda to define which days to scrape (since previous timestamp). Synchronous simply means waiting for the result before progressing.
  • Lambda triggering another lambda asynchronously. Once the Scraper Lambda finishes its scraping task, it invokes the DB Insert Lambda to check S3 for new data, and insert into the RDS DB.

It seems a bit challenging to find a concrete list of what are all the possible Lambda triggers. The best way I found is to start creating a Lambda, hit the "triggers" button to add a new trigger for the new Lambda, and the just scroll the list of options. Some main examples I noted:

  • API Gateway (API-GW) events.
  • AWS IoT: AWS Button events and custom events (whatever that means..). Never tried the AWS IoT stuff.
  • Application load balancer events
  • Cloudwatch logs, events when new logs are received to a configured log group
  • Code commit: AWS provides some form of version control system support. This triggers events from Git actions (e.g., push, create branch, …).
  • Cognito: This is the AWS authentication service. This is a "sync" trigger, so I guess it gets triggered when authentication data is synced.
  • DynamoDB: DynamoDB is the AWS NoSQL database. Events can be triggered from database updates, in batches if desired. Again, I have not used it, just my interpretation of the documentation.
  • Kinesis: Kinesis is the AWS service for processing real-time timeseries type data. This seems to be able to trigger on the data stream updates, and data consumer updates.
  • S3: Events on create (includes update I guess) and delete of objects, events on restoring data from Glacier.
  • RRS object loss. RRS is reduced redundancy storage, with more likely chance that something is lost than on standard S3.
  • SNS: Triggers on events in the simple notifaction service (SNS).
  • SQS: Updates on an event stream in simple queue service (SQS). Can also be batched.

That’s all interesting.

Security Groups and Service Policies

To make all my service instances connect, I need to define all my VPC network, service and security configurations, etc. A security group is a way to configure security attributes (AWS describes it as a "virtual firewall"). Up to 5 (five) security groups can be assigned to an instance. Each security group then defines a set of rules to allow or deny traffic.

The following figure illustrates the security groups in this (my) experiment:

security groups

There are 3 security groups here:

  • SG1: The RDS group, allowing incoming connections to port 3306 from SG2. 3306 is the standard MariaDB port.
  • SG2: The RDS client group. This group can query the RDS Maria DB using SQL. Any regular MariaDB client works.
  • SG3: Public SSH access group. Instances in this group allow connections to port 22 from the internet.

This nicely illustrates the concept of the "group" in a security group. The 3 instances in SG2 all share the same rules, and are allowed to connect to the RDS instance. Or, more specifically, to instances in the SG1 group. If I add more instances, or if the IP addresses of these 3 instances change, as long as they are in the security group, the rules will match fine.

Similarly, if I add more RDS instances, I can put them in the same SG1 group, and they will share the same properties. The SG2 instances can connect to them. Finally, if I want to add more instances accessible over SSH, I just set them up and add them to the security group SG3. As shown by the EC2 instance in the figure above, a single instance can also be a member of multiple security groups at once.

This seems like a nice and flexible way to manage such connections and permissions in an "elastic" cloud, which I guess is the point.

Lambda Policies

There are also two instances in my architecture figure that are not in any security group as they are not in a VPC. To belong to a security group, an instance has to be inside a VPC. The Scraper Lambda and the S3 Chat Logs bucket are outside the VPC. The connection from inside the VPC to S3 I already described earlier in this post (S3 endpoints). For the Scraper Lambda, Lambda policies are defined.

In fact, all Lambdas have such access policies defined in relation to the services they need to access, including the ones inside the VPC. The in-VPC ones just need to have the associated VPC mechanisms (security groups) enabled as well, since they also fall inside the scope of the VPC. There are some default policies, such as execution permissions for the Lambda itself. But also on resources it needs to access.

These are the policies I used for each of the Lambda here:

  • Scraper Lambda:

    • Lambda Invoke: Allows this Lambda to invoke the Timestamp and DB Insert Lambdas.
    • CloudWatch Logs: Every Lambda writes their logs to AWS CloudWatch.
    • S3 put objects: Allows this Lambda to write the scraping results to the S3 Chat Logs bucket.
    • S3 list objects: Just to check the bucket so it does not overwrite existing logs if somehow run multiple times per day.
  • DB Insert Lambda:

    • CloudWatch Logs: Logging as above
    • S3 List and Get Objects: For reading new log files created by the Scraper Lambda.
    • EC2 ENI interface create, list, delete: In-VPC Lambdas work by creating Elastic Network Interfaces withing the VPC so they can communicate with other in-VPC (and ext-VPC) instances. This enables that.
  • Timestamp Lambda:

    • CloudWatch Logs: Logging as above
    • EC2 ENI interfaces for in-VPC Lambda, as above.

As these show, the permissions can be defined at very granular level or at a higher level. For example, full access to S3 and any bucket, or read access to specific files in a specific bucket. Or anything in between.

Backups and Data Retention

One thing with databases is always backups. With AWS RDS, there are a few options. One is the standard backups offered by Amazon. Your RDS gets snapshotted to S3 daily. How this actually works sounds very simple but gets a bit complicated if you really try to understand it. What doesn’t…

So the documentation says "The first snapshot of a DB instance contains the data for the full DB instance" and further "Subsequent snapshots of the same DB instance are incremental, which means that only the data that has changed after your most recent snapshot is saved.".

Sounds great, doesn’t it? But think about it for a minute. You can set your backup retention period to be between 0 to 35 days (currently anyway). Default being 7. Now imagine you live all the way up to the 8th day when your first day backup expires. Consider that only day 0 was a full backup snapshot, and day 0 expires. Does everything else then build incremental snapshots on top of a missing baseline after day 0 expires?

Luckily, as usual, someone thought about this already and StackOverflow comes to the rescue. An RDS "instance" as referred to in the AWS documentation must be referencing the EC2 style VM instance that hosts the RDS. So the backup is not just data but the whole instance. And the instance is stored on Elastic Block Storage (EBS). I interpret this to mean you are not really backing up the database but the EBS volume where the whole RDS system is on. And then you can go read up on how AWS manages the EBS backups. Which mostly confirms the StackOverflow post.

Regarding costs, if it is an "instance snapshot", does the whole instance size count towards the cost? I guess it not, as you get the same size of "free" backup storage as you allocate for your RDS. In the free tier you get up to 20GB of RDS storage included, and by definition also up to 20GB of free RDS backup snapshot space. The free backup size always matches the storage size. If this included the whole instance in the calculation, the operating system and database software would likely take many GB already. But what do I know. As for where are the snapshots stored? Can I go have a look? Again, StackOverflow to the rescue. It is in an S3 bucket, but that bucket is not in my control and I have no way to see in it.

In any case, you get your RDS size worth of backup space included, whatever the size here means (instance vs data). And if you use the default 7 days period, it means you will have to fit all the 7 incremental snapshots in that space if you do not wish to pay extra. And the snapshots are stored as "blocks", and only the ones not referenced by existing snapshots are not deleted when an old (or the origin) snapshot is deleted. So expiring day 0 does not cause the incremental delta snapshots to break. It just deletes the blocks not referenced after the expired date.

Still, there is more. The AWS documentation on backup restoration mentions you can do point-in-time restoration typically up to 5 minutes before current time. If automated snapshots are only taken once per day, what is this based on? It is because AWS also uploads the RDS transaction logs to S3 every 5 minutes.

Beyond regular backups, there is the multi-AZ deployment of RDS, and read replicas. Both of those links contain a nice comparison table between the two. The multi-AZ is mainly for disaster recovery (DR), doing automated failover to another availability zone in case of issues. A read replica allows scaling reads via multiple asynchronously synced copies that can be deployed across regions and availability zones. A read-replica can also be manually promoted to master status. It all seems to get complicated, deeper than I wanted to go here. I guess you need an expert on all this to really understand what gets copied where, when, and how secure the failover is, or how secure the data is from loss in all cases.

After this hundred lines of digressing from my experiment, what did I actually use? Well I used a single-AZ deployment of my RDS and disabled all backups completely. Nuts, eh? Not really, considering that my architecture is built to collect all the scraped data into S3 buckets, and from there inserted into RDS by another Lambda function. So all the data is actually already backed up in S3, in form of the scraped files. Just re-run the import from the beginning for all the scraped files if needed (to rebuild the RDS).

Given how the RDS backups and my implementation all depend so heavily on S3, it seems very relevant to understand the storage and replication, reliability, etc. of S3.

S3 Reliability and Lifecycle Costs

The expected durability for S3 objects is given as 99.999999999%. AWS claims about 1 in 10 million objects would be lost during 10000 years. Not sure how you might test this. However, this is defined to hold for all the S3 tiers, which are:

  • standard: low latency and high throughput, no minimum storage time or size, survives AZ destruction, 99.99% availability
  • standard infrequently access (IA): Same throughput and latency as standard, lower storage cost but higher retrieval cost. 99.9% availability target, 30 days minimum billing.
  • one zone IA: same as standard IA, but only in one AZ, slighty cheaper storage, same retrieval cost, 99(.5)% availability, 30 days minimum billing
  • intelligent tiering: for data with changing access patterns. Automatically moves data to IA tier when not accessed for 30 days, and back to standard when accessed. 99.9% availability target, 30 days minimum billing.
  • glacier: very low storage price, higher retrieval prices (tiered on how fast you want it), 99.99% availability, 90 days minimum,
  • glacier deep archive: like glacier but slower and cheaper, 99.9% availability target, 180 days minimum.

Naturally some of these different tiers make more sense at different times of the object lifecycle. So AWS allows you to define automated transitions between them. These are called Lifecycle Rules. AWS examples are one good way to explain them.

Free tier does not seem to include other than some use of the S3 standard tier, but just to try this out in a bit more realistic fashion, I defined a simple lifecycle pipeline for my log files as illustrated here:

s3 lifecycle

I did not actually implement the final Glacier transition, as it has such a long minimum storage time and I want to be able to terminate my experiment in a shorter duration.

It is also possible to define prefix filters and tag filters to select the objects for which the defined rules apply to. Prefix filters can be such as "logs1/" to match all objects that are placed under the "logs1" folder. S3 does not actually have real folder-style hierarchical structure, but naming files/objects like this makes it treat them like "virtual folders". So I defined such a prefix filter, just because it is nice to experiment and learn. Besides filename filters, one can also define tag filters in the form of key/value pairs for tags.

So my defined S3 lifecycle rules in this case, and reasoning for them:

  • Transition from standard to standard-IA after 30 days. Let’s me play with the data if the import has issues for a few days. After that it should be in the RDS and just keep it around in cheaper tier "just in case". Well that and 30 days was the minimum AWS allowed me to set.
  • Filter by the prefix "logs1/" as that was the path I used. I used the path simply to give me some granular control over time, as this allowed simple time-based filtering in the API queries if I would use "logs2/" after a year or so. Would need to update this transition rule then, or simply set it to "log" prefix at that time.
  • I did not define data expiration time. The idea would be to use this type of data for training machine learning systems. You would want to maximize that data, and enable experiments later. Not that I expect to build such real systems on all this, but in a real scenario I think this would also make sense, so trying to stay there.
  • Transition to Glacier maybe after 2 months? Just a possibile though. No idea. But some discussions online led me to understand there is a minimum time interval before one can "Glacier" objects. Similar to the 30 day minimum I hit on S3 standard -> S3 IA. If it is also 30 days for Glacier, that could make it 30+30 = about 2 months.

Random Thoughts

Initially I though security groups were a bit weird and unnecessarily complex, compared to fiddling with your private computers and networks. But with all this, I realized the "elastic" nature of AWS actually fits this quite well. It allows the security definitions to live with the dynamic cloud via the group associations.

Regarding this, Network Access Control Lists (NACL) would allow more fine-grained traffic rules at subnet levels. This is still a bit fuzzy part for me, as I did not need to go into such details in my limited experiment. But in skilled hands it seems to be quite useful. Maybe more for a security/network specialist.

Lambda functions that are not explicitly associated to the VPC are still associated to a VPC. This is simply some kind of "special" [AWS controlled VPC](to a VPC). Makes me wonder a bit about what are all the properties of this "special" VPC, but I guess I just have to "trust AWS" again.

Looking at the architecture I used, the only thing in the public subnet is the EC2 instance I used to run my manual SQL queries against my RDS instance. Could I just drop the EC2 instance and as a result delete the whole public subnet and the Internet Gateway? The "natural" way that comes to mind is having access through an internet-connected Lambda function. An internet search for "AWS lambda shell" gives some interesting hits.

Someone used a Lambda shell to get access into the Lambda runtime environment, and download the AWS runtime code. Another similar experiment is hosted on Github, providing full Cloudformation templates to make it run as well. Finally, someone set up a Lambda shell in a browser, providing a bounty for anyone who manages to hack their infrastructure. Interesting..

On a related note, a NAT GW always requires an IGW. So if I wanted a private subnet to have access to the internet, I would still need a public subnet, even if it was otherwise empty besides the NAT-gateway. And while the NAT-GW is advertised as autoscaling and all that, I still need a NAT GW in each AZ used. But just one per AZ, of course.

Something I got very confused about are the EC2 instance storage types. There is instance store, which is "ephemeral", meaning it disappears when the VM is stopped or terminated. I always thought of this as a form of a local hard disk. Similar to how my laptop has a hard disk (or SSD…) inside. And the AWS docs actually describe it similarly as "this storage is located on disks that are physically attached to the host computer". Not too complicated?

But what is the other main option, Elastic Block Store (EBS)? The terms "block" and "storage" or "disk" bring to my mind the traditional definition of hard disks, with sectors hosting blocks of data. But it makes no sense, as EBS is described as virtual, replicated, highly available, etc.. A basic internet search also brings up similar, rather ambiguous definitions.

Some searching later, I concluded this is referring to the networked disk terminology, where block storage seems to have its own definition. So racks of disks, connected via Storage Array Network (SAN) technologies. As AWS advertises this with all kinds of fancy terminology, I guess it must be quite highly optimized, otherwise networked disks spreading data across physical hosts would seem slow to me. Probably just taking some of the well refined products in the space and turning it into an integrated part of AWS, as an "Elastic" branded service. But such is progress, it’s not the 80s disks as it used to be, Teemu. This definition makes sense considering what it is supposed to be as well. While the details are somewhat lacking, a Quora post provided some interesting insights.

I used RDS in this experiment to store what is essentially natural language text data. This is not necessarily the best option for this type of data. Rather something like Elasticsearch and its related AWS hosted service would be a better match. I simply picked RDS here to give myself a chance to try it out as a service, and since I don’t expect to store gigabytes of text in my experiment, or run complex searches over it. SQL is quite a simple and well tried out language, so it was easy enough for me to play with it and see everything was working. However, for a more real use case I would transition to Elasticsearch.

Despite all the fancy stuff, "Elastic" services, and all, some things on the AWS platform are surprisingly rigid still. I could not find a way to rename a change descriptions for Lambda functions or Security Groups. It seems I am not the only one who has thought about either of these over time. No wonder, as it seems rather basic functionality.

Overall, with my experience setting all this up, I used to think DevOps meant DEVelopment and OPerationS working closely together. Looking at this as well as all the trendy "infrastructure as code" lately, I am leaning more towards the side of DevOps referring to making the developers do the job of operations in addition to developing the system/service/program running on the infrastructure.. That’s just great…

Next I will look into making use of this type of collected data to build some service on AWS. Probably an API-Gateway and Lambda based service exposing a machine learning based model trained on the collected training data. In this case, I think I will look into using existing Twitter datasets to avoid considerations in all the aspects of actual data collection and use. But that would be another post, and I hope to also look into another one for how one might set up a dev/test environment for AWS style services. Later…

The code for this post is available on my related Github project. More specifically, the Lambda functions are at:

Testing Machine Learning Applications (or self-driving cars, apparently)


Testing machine learning (ML) based applications, or ML models, requires a different approach from more "traditional" software testing. Traditional software is controlled by manually defined, explicit, control-flow logic. Or to put it differently, someone wrote the code, it can be looked at, read, and sometimes even understood (including what and how to test). ML models, and especially deep learning (DL) models, implement complex functions that are automatically learned from large amounts of data. The exact workings are largely a black box, the potential input space huge, and knowing what and how to exactly test with them is a challenge. In this post I look at what kind of approaches have been taken to test such complex models, and consider what it might mean in the context of broader systems that use such models.

As I was going through the different approaches, and looking for ways in which they address testing ML models in general, I found that most (recent) major works in the area are addressing self-driving / autonomous cars. I guess it makes sense, as the potential market for autonomous cars is very large, and the safety requirements very strict. And it is a hot market. Anyway. So most of the topics I cover will be heavily in the autonomous cars domain, but I will also try to find some generalization where possible.

The Problem Space

Using autonomous cars as examples, most autonomous cars use external cameras as (one of) their inputs. In fact, Tesla seems to use mainly cameras, and built a special chip just to be super-effective at processing camera data with (DL) neural nets. Consider the input space of such cameras and systems as an example of what to test. Here are two figures from near the office where I work:

Dry vs Snow

The figure-pair above is from the same spot on two days, about a week apart. Yes, it is in May (almost summer, eh?), and yes, I am waiting for all your epic offers for a nice job to work on all this in a nice(r) weather. Hah. In any case, just in these two pictures there are many variations visible. Snow/no snow, shadows/no shadows, road markers or not, connecting roads, parking lots, other cars, and so on.

More generally, the above figure-pair is just two different options from a huge number of different possibilities. Just some example variants that come to mind:

environment related:

  • snowing
  • raining
  • snow on ground
  • puddles on the ground
  • sunny
  • cloudy
  • foggy
  • shadows
  • road blocked by construction
  • pedestrians
  • babies crawling
  • cyclists
  • wheelchairs
  • other cars
  • trucks
  • traffic signs
  • regular road markings
  • poor/worn markings
  • connecting roads
  • parking lots
  • strange markings (constructions, bad jokes, …)
  • holes in the road
  • random objects dropped on the road
  • debris flying (plastic bags etc)
  • animals running, flying, standing, walking, …

car related

  • position
  • rotation
  • camera type
  • camera quality
  • camera position
  • other sensors combined with the camera

And so on. Then all the other different places, road shapes, objects, rails, bridges, whatever. This all on top of the general difficulty of reliably identifying even the basic roads, pavements, paths, buildings, etc. in the variety of the real world. Now imagine all the possible combinations of all of the above, and more. And combining those with all the different places, roads, markings, road-signs, movements, etc. Testing all that to some reasonable level of assurance so you would be willing to sign it off to "autonomously drive". Somewhat complicated.

Of course, there are other ML problems in multitude of domains, such as natural language processing, cybersecurity, medical, and everything else. The input space might be slightly more limited in some cases, but typically it grows very big, and has numerous elements and combinations, and their variants.

Neural Nets

In some of the topics later I refer to different types of neural network architectures. Here is just a very brief summary of a few main ones, as a basis, if not familiar.

Fully Connected / Dense / MLP

Fully connected (dense) networks are simply neural nets where every node in each layer is connected to every node in the previous layer. This is a very basic type of network, and illustrated below:


I took this image off the Internet somewhere, trying to search for it, there seem to be a huge number of hits on it. So No idea of the real origins, sorry. The "dense" part just refers to all neurons in the network being connected to all the other neurons in their neighbouring layers. I guess this is also called a form of multi-layer perceptron (MLP). But I have gotten more used to calling it Dense, so I call it Dense.

The details of this model are not important for this post, so for more info, try the keywords on internet searchs. In this post, it is mainly just important to understand the basic structure, where neuron coverage, or similar references are later mentioned.


Convolutional neural networks (CNNs) aim to automatically identify local features in data, typically in images. These features start from higher level patterns (e.g., "corners" but can be any abstract for the CNN finds useful) to more compressed ones as the network progresses towards classification from the raw image. The CNNs work by sliding a window (called "kernel" or "filter") across the image, to learn ways to identify those features. The filters (or their weight numbers) are also learned during training. The following animation from Stanford Deeplearning Wiki illustrates a filter window rolling over an image:


In the above animation, the filter is sliding across the image, which is represented in this case by some numbers of the pixels in the image. CNN networks actually consist of multiple such layers among with other layers. But I will not go into all the details here. For more info, try the Internet.

CNNs can be used for anything that can have similar data expressivity and local features. For example, they are commonly applied to text using 1-dimensional filters. The idea of filters comes up with some coverage criteria leter, which is why this is relevant here.


Another type of common network type is recurrent neural networks (RNNs). Whereas a common neural network does not consider time aspect, recurrent neural nets loop over data in steps, and use some of the previous steps data as input also as input to following steps. In this sense, it has some "memory" of what it has seen. They typically work better when there is a relevant time-dimension in the data, such as analyzing objects in a driving path over time, or stock market trends. The following figure illustrates a very simplified view of an unrolled RNN with 5 time-steps.


This example has 5 time-steps, meaning it takes 5 values as input, each following one another in a series (hence, "time-series"). The X1-X5 in the figure are the inputs over time, the H1-H5 are intermediate outputs for each time-step. And as visible in the network, each time-step feeds to the next as well as producing the intermediate output.

Common, more advanced variants of this are Long Short Term Memory (LSTM) and Gated Recurrent Unit (GRU) networks. Again, try some tutorial searches for details.

As with CNN, the relevance here is when some of these concepts (e.g., unrolling) come up with coverage criteria later, and maybe as some background to understand how they work. For example, how a car system might try to predict driving angles based on recent history of visual images it observes over time, not just a single frame.


A generative adversarial network (GAN) is a two-part network, where one part generates "fake" images, and another part tries to classify them as correct or fake. The idea is simply to first build a classifier that does a good job of identifying certain types (classes) of images. For example, rainy driving scenes. Like so:

Classifier base

Once this classifier works well, you train another network that tries to generate images to fool this classifier. The network that generates the "fake" images is called the generator, and the one that tries to tell the fakes from real is called the discriminator. In a sense the discriminator network provides the loss function for the generator. Meaning it just tell how well the generator is doing, and how to do better. Like so:

Generator NW

So the generator learns to generate realistic fakes by learning to fool the discriminator. In a way, one could describe such a model to "style" another image to a different style. But GANs are used for a variety of image generation and manipulation tasks. When the fake-generator becomes good enough to fool the good classifier, you have a working GAN model. The "adversarial" part refers to the fake-generator being an adversary for the classifier (discriminator) part.

The following image illustrates a real GAN transformation performed by UNIT from summery image (real) to wintery image (fake):


It is a nice looking transformation, very realistic. In the context of this post, this pair provides very useful test data, as we will see later. Here is what it might look like if it was able to generate the snowy road I put in the beginning of this post vs the dry road on different days:


So the real world is still a bit more diverse, but the generated figures from UNIT are already very impressive. I have no doubt they can be of use in testing ML system, as I will discuss later in this post. The same approach also works beyond images, as long as you can model the data and domain as a GAN problem. I believe it has been applied to generate sounds, music, text, .. probably much more too.

This GAN information is mainly relevant, for this post, in terms of having some basic understanding of what a GAN is as this comes up in some of the works later in this post. On generating different types of driving scenarios for testing. For more details on the topic, I refer to the mighty Internet again.

Getting (Realistic) Data

A fundamental aspect in applying ML, and especially DL, is getting as much high quality data as you can. Basically, the more the merrier. With more data, you can train bigger and better models, get higher accuracy, run more experiments and tests, and keep improving. Commonly this data also needs "labels" to tell when something is happening in the data, to enable the ML algorithms to learn something useful from it. Such as steering a car in response to the environment.

How to get such data? Autonomous car development is, as usual, a good example here.

Car companies have huge fleets of vehicles on the road, making for great data collection tools. Consider the Tesla Autopilot as an example. At the time of writing this (May 2019), the Tesla Autopilot website describes the system as having:

  • 8 surroung cameras with up to 250M visibility
  • 12 ultrasonic sensors
  • forward facing radar (advertised as "able to see through heavy rain, fog, dust and even the car ahead")

Imagine all this deployed in all their customers cars all the time, recording "free" training data for you. Free as in you as the car manufacturer (or I guess "developer" these days) does not pay people to drive and collect data, but rather the customers pay the manufacturer for the privelege to do so. Potentially millions of customers.

In the Tesla Autonomy day, Andrej Karpathy (famous ML guy, now director of AI at Tesla), describes the Tesla approach as first starting with a manually collected and labeled training set. Followed by using their fleet of cars to identify new scenarios, objects, anything of interest that the system should learn to handle better. Dispatching requests to all their cars out there to provide images and data on such scenarios if/when identified.

They also describe how they use the sensor data across different sensors to train the system better. That is, using the radar measurements to give labels for how far identified objects are, to train the vision system to recognize the distance as well.

In a more traditional software testing terminology, this might be called a form "online testing", or more generally just continous monitoring, learning, and improvement. As I noted above, Tesla even introduced their specialized DL processing chips for their cars called "Tesla HW3", to enable much more efficient and continous processing of such ML models. Such systems will also be able to continously process the data locally to provide experiences on the usefulness of the ML model (compared to user input), and to help build even better "autonomy".

Taking this further, there is something called Tesla Shadow Mode, a system that is enabled all the time when human driver is driving the car themselves or with the help of the autopilot. The (ML) system is constantly trained on all the data collected from all the cars, and runs in parallel when the car is driving. It does not make actual decisions but rather "simulated" ones, and compares them to how the human driver actually performed in the situation. Tesla compares the differences and uses the data to refine and analyze the system to make it better.

This makes for an even more explicit "online testing" approach. The human driver is providing the expected result for the test oracle, while the current autopilot version provides the actual version (output). So this test compares the human driver decisions to the autonomous guidance system decisions all the time, records differences and trains itself to do better from the differences. Test input is all the sensor data from the car.

Beyond Tesla, there seems to be little information to be found on how other Car companies do this type of data collection. I guess maybe because it can be one of the biggest competitive edges to actually have access to the data. However, I did find a description from few years back on the Waymo system for training their autonomous cars. This includes both an advanced simulation system (called CarCraft), as well as a physical site for building various test environments. Data is collected from actual cars driving the roads, new and problematic scenarios are collected and analyzed, modelled and experimented with extensively in the simulations and the physical test environment. Collected data is used to further train the systems, and the loop continues. In this case, the testing includes real-world driving, physical test environments, and simulations. Sounds very similar to the Tesla case, as far as I can tell.


What about the rest of the ML world besides autonomous cars? One example that I can provide for broader context is from the realm of cybersecurity. Previously I did some work also in this area, and the cybersecurity companies are also leveraging their information and data collection infrastructure in a similar way. In the sense of using it to build better data analysis and ML models as a basis for their products and services.

This includes various end-point agents such as RSA NetWitness, TrendMicro, Carbon Black, Sophos Intercept, and McAfee. Car companies such as Tesla, Waymo, Baidu, etc. might use actual cars as endpoints to collect data. In a similar way, pretty much all cybersecurity companies are utilizing their deployed infrastructure to collect data, and use that data as inputs to improve their analytics and ML algorithms.

These are further combined with their Threat Intelligence Platforms, such as Carbon Black Threat Analysis, Brightcloud Webroot, AlienValue Open Threat Exchange, and Internet Storm Center. These use the endpoint data as well as the cybersecurity vendors own global monitoring points and network to collect data and feed them to their ML algorithms for training and analysis.

The above relates the cybersecurity aspects to the autonomous car aspects of data collection and machine learning algorithms. What about simulation to augment to augment the data in cybersecurity vs autonomous cars? I guess the input space is a bit different in nature here, as no form of cybersecurity simulation directly comes to mind. I suppose you could also just arrange cyber-excersises and hire hackers / penetration testers etc to provide you with the reference data in attacks against the system. Or maybe my mind is just limited, and there are better approaches to be had ? πŸ™‚

Anyway, in my experience, all you need to do is to deploy a simple server or services on the internet and within minutes (or seconds?) it will start getting hammered with attacks. And it is not like the world is not full of Chinese hackers, Russian hackers, and all the random hackers to give you real data.

I guess this illustrates some difference in domains, where cybersecurity looks more for anomalies, which the adversary might try to hide in different ways in more (semi-)structured data. Autonomous cars mostly need to learn to deal with the complexity of real world, not just anomalies. Of course, you can always "simulate" the non-anomaly case by just observing your cybersecurity infrastructure in general. This is maybe a more realistic scenario (to observe the "normal" state of a deployed information system) to provide a reference than observing a car standing in a parking lot.

In summary, different domains, different considerations, even if the fundamentals remain.

MetaMorphic Testing (MT)

Something that comes up all the time in ML testing is MetaMorphic Testing, sometimes also called property-based testing. The general idea is to describe the software functionality in terms of relations between inputs and outputs, rather than exact specifications of input to output. You take one input which produces an observed output, modify this input, and instead of specifying an exact expected output, you specify a relationship between the original input and related output and the modified input and its related output.

An example of a traditional test oracle might be to input a login username and password, check that it passes with correct inputs, and fails with wrong password. Very exact. Easy and clear to specify and implement. The general examples for metamorphic testing are not very exciting, such as something about sin function. A more intuitive example I have seen is about search engines (Zhou2016). Search engines are typically backed by various natural language processing (NLP) based machine learning approaches, so it is fitting in that regard as well.

As a search-engine example, input the search string "testing" to Google, and observe some set of results. I got 1.3 billion hits (matching documents) as a result when I tried "testing" when writing this. Change the query string to be more specific "metamorphic testing" and the number of results should be fewer than the original query ("testing" vs "metamorphic testing"), since we just made the query more specific.

In my case I got 25k results for "metamorphic testing" query. This is an example of the metamorphic relation: a more specific search term should less or equal number of results as the original query. So if "metamorphic testing" would have resulted in 1.5 billion hits instead of 25k, the test would have failed, since it broke the metamorphic relation we defined. With 25k, it passes as the results are fewer than for the original search quer, and the metamorphic relation holds.

In the following examples, metamorphic testing is applied to many views of testing ML applications. In fact, people working on metamorphic testing must be really excited since this seems like the golden age of applying MT. It seems to be a great match to the type of testing needed for ML, which is all the rage right now.

Test Generation for ML applications

This section is a look at test generation for machine learning applications, focusing on testing ML models. I think the systems using these models as parth of their functionality need to be looked at in a broader context. In the end, they use the models and their predictions for something, which is the reason the system exists. However, I believe looking at testing the ML models is a good start, and complex enough already. Pretty much all of the works I look at next use variants of metamorphic testing.

Autonomous cars

DeepTest in (Tian2018) uses transformations on real images captured from driving cars to produce new images. These are given as input to a metamorphic testing process. The system prediction/model output (e.g., steering angle). The metamorphic relation is to check that the model prediction does not significantly change across transformation of the same basic image. For example, the driving path in sun and rain should be about the same for the same road, in the same environments, with same surrounding traffic (cars etc.). The relation is given some leeway (the outputs do not have to be 100% match, just close), to avoid minimal changes causing unnecessary failures.

The transformations they use in include:

  • brightness change (add/substract constant from pixels)
  • contrast change (multiply pixels by constant)
  • translation (moving/displacing the image by n pixels)
  • scaling the image bigger or smaller
  • shear (tilting the image)
  • rotation (around its center)
  • blur effect
  • fog effect
  • rain effect
  • combinations of all above

The following illustrates the idea of one such transformation on the two road images (from near my work office) from the introduction:


For this, I simply rotated the phone a bit, while taking the pics. The arrows are there just as some made-up examples of driving paths that a ML algorithm could predict. Not based on any real algorithm, just an illustrative drawing in this case. In metamorphic testing, you apply such transformations (rotation in this case) on the images, run the ML model and the system using its output to produce driving (or whatever else) instructions, compare to get a similar path before and after the transformation. For real examples, check (Tian2018). I just like to use my own when I can.

My example from above also illustrates a problem I have with the basic approach. This type of transformation would also change the path the car should take, since if the camera is rotated, the car would be rotated too, right? I expect the camera(s) to be fixed in the car, and not changing positions by themselves.

For the other transformations listed above this would not necessarily be an issue, since they do not affect the camera or car position. This is visible in a later GAN based example below. Maybe I missed something, but that is what I think. I believe a more complex definition of the metamorphic relation would be needed than just "path stays the same".

Since there are typically large numbers of input images, applying all transformations above and their combinations to all of images would produce very large input sets with potentially diminishing returns. To address this, (Tian2018) applies a greedy search strategy. Transformations and their combinations are tried on images, and when an image or a transformation pair is found to increase neuron activation coverage, they are selected for further rounds. This iterates until defined ending thresholds (number of experiments).

Coverage in (Tian2018) is measured in terms of activations of different neuros with the input images. For dense networks this is simply the activation value for each neuron. For CNNs they use averages over the convolutional filters. For RNNs (e.g., LSTM/GRU), they unroll the network and use the values for the unrolled neurons.

A threshold of 0.2 is used in their case for the activation measure (neuron activation values range from 0 to 1), even if no rationale is give for this value. I suppose some empirical estimates are in order, in whatever case, and this is what they chose. How these are summed up across the images is a bit unclear from the paper but I guess it would be some kind of combinatorial coverage thing. This is described as defining a set of equivalence partitions based on the similar neuron activations, so maybe there is more to it but I did not quite find it in the paper. An interesting thought in any case, to apply equivalence partitioning based on some metrics over the activation thresholds.

The results (Tian2018) get in terms of coverage increases, issues found etc. are shown as good in the paper. Of course, I always take that with a bit of salt since that is how academic publishing works, but it does sound like a good approach overall.

Extending this with metamorphic test generation using generative adversarial networks (GANs) is studied in (Zhang2018(1)). GANs are trained to generate images with different weather conditions. For example, taking a sunny image of a road, and transforming this into a rainy or foggy image. Metamorphic relations are again defined as the driving guidance should remain the same (or very close) as before this transformation. The argument is that GAN generated weather effects and manipulations are more realistic than more traditional synthetic manipulations for the same (such as in DeepTest from above, and cousins). They use UNIT (Liu2017) toolkit to train and apply the GAN models, using input such as YouTube videos to train.

The following images illustrate this with two images from the UNIT website:


On the left in both pairs is the original image, on the right is the GAN (UNIT) transformed one. The pair on the left transformed into night scene, the pair on the right transformed into a rain scene. Again, I drew the arrows myself, and this time they would also make more sense. The only thing changed is the weather, the shape of the road, other cars, and other environmental factors are be the same. So the metamorphic test relation where the path should not change should hold true.

The test oracle in this work is similar to other works (e.g., (Tian2018 above). The predictions for the original and transformed image are compared and should closely match.

Input Validation

A specific step of input validation is also presented in (Zhang2018(1)). Each image analyzed by the automated driving system is first verified in relation to known input data (images it has seen before and was trained on). If the image is classified as significantly different from all the training data previously fed into the network, the system is expected to advise the human driver to take over. The approach for input validation in (Zhang2018(1)) is to project the high-dimensional input data into a smaller dimension, and measure the distance of this projection from the projections of the used training data. If the distance crosses a set threshold, the system alerts the human driver about potentially unknown/unreliable behaviour. Various other approaches to input validation could probably be taken, in general the approach seems to make a lot of sense to me.

LIDAR analysis

Beyond image-based approaches only, a study of applying metamorphic testing on Baidu Apollo autonomous car system is presented in (Zhou2019). In this case, the car has a LIDAR system in place to map its surroundings in detail. The system first identifies a region of interest (ROI), the "drivable" area. Which sounds like a reasonable goal, although I found no process of identifying it in (Zhou2019). This system consists of multiple components:

  • Object segmentation and bounds identification: Find and identify obstacles in ROI
  • Object tracking: Tracking the obstacles (perhaps their movement? hard to say from paper)
  • Sequential type fusion: To smooth the obstacles types over time (I guess to reduce misclassification or lost objects during movement over time)

Their study focuses on the object identification component. The example of a metamorphic relation given in (Zhou2019) is to add new LIDAR points outside the ROI area to an existing point cloud, and run the classifier on both the before- and after-state. The metamorphic relation is again that same obstacles should be identified both before and after adding small amounts of "noise". In relation to the real world, such noise is described as potentially insects, dust, or sensor noise. The LIDAR point cloud data they use is described as having over 100000 samples per point cloud, and over a million per second. The ratio of noise injected is varied with values of 10, 100, and 1000 points. These are considered small numbers compared to the total of over 100k points in each point-cloud (thus "reasonable" noise). Each experiment produces errors or violations of the metamorphic relation in percentages of 2.7%, 12.1%, and 33.5%.

The point cloud is illustrated by this image from the paper:


In the figure above, the green boxes represent detected cars, and the small pink one a pedestrian. The object detection errors found were related to missing some of these types of objects when metamorphic tests were applied. The results they show also classify the problems to different numbers of mis-classifications per object type, which seems like a useful way to report these types of results for further investigation.

(Zhou2019) report discussing with the Baidu Apollo team about their findings, getting acknowledgement for the issues, and considered the MT approach could be useful approach to augment their train and test data. This is in line with what I would expect: ML algorithms are trained and evaluated on the training dataset. If testing finds misclassifications, it makes sense to use the generated test data as a way to improve the training dataset.

This is also visible in the data acquisition part I discussed higher above, in using different means to augment data collected from real-word experiments, real-word test environments, and simulations. MT can function as another approach to this data augmentation, but with an integrated test oracle and data generator.

Other Domains

Besides image classifiers for autonomous cars, ML classifiers have been applied broadly to image classification in other domains as well. A metamorphic testing related approach to these types of more generic images is presented in (Dwarakanath2018). The image transformations applied in this case are:

  • rotate by 90, 180, 270
  • flip images vertically or horizontally
  • switching RGB channels. e.g., RGB becomes GRB
  • normalizing the (test) data
  • scaling the (test) data

The main difference I see in this case is the domain specific limitations of autonomous cars vs general images. If you look to classify random images on the Internet or on Facebook, Instagram, or similar places, you may expect them to arrive in any short of form or shape. You might not expect the car to be driving upside down or to have strangely manipulated color channels. You might expect the car system to use some form of input validation, but that would be before running the input images through the classifier/DL model. So maybe rotation by 90 degrees is not so relevant to test for the autonomous car classifier (expect as part of input validation before the classifier), but it can be of interest to identify a rotated boat in Facebook posts.

Another example of similar ideas is in (Hosseini2017) (probably many others too, …). In this case they apply transformations as negations of the images. Which, I guess, again could be useful to make general classifiers more robust.

An example of applying metamorphic testing to machine learning in the medical domain is presented in (Ding2017). This uses MT to generate variants of existing high-resolution biological cell images. A number of metamorphic relations are defined to check, related to various aspects of the cells (mitochondria etc.) in the images, and the manipulations done to the image. Unlike the autonomous car-related examples, these seem to require in-depth domain expertise and as such I will not go very deep into those (I am no expert in biological cell structures). Although I suppose all domains would need some form of expertise, the need just might be a bit more emphasized in the medical domain.

Generating Test Scenarios

An approach of combining model-based testing, simulation, and metamorphic testing for testing an "AI" (so machine learning..) based flight guidance systems of autonomous drones is presented in (Lindvall2017).

The drone is defined as having a set of sensors (or subset of these):

  • inertial measurement unit
  • barometer
  • GPS
  • cameras
  • ultrasonic range finder

Metamorphic relations:

  • behaviour should be similar across similar runs
  • rotation of world coordinates should have no effect
  • coordinate translation: same scenario in different coordinates should have no effect
  • obstacle location: same obstacle in different locations should have same route
  • obstacle formation: similar to location but multiple obstacles together

The above are related modifying the environment. Relations that are used as general test oracle checks where values should always stay within certain bounds:

  • obstacle proximity
  • velocity
  • altitude

MBT is typically used to represent the behaviour of a system as a state-machine. The test generator traverses the state-machine, generating tests as it goes. In this case, the model generates test steps that create different elements (landing pads, obstacler, etc) in the test environemnt. The model state contains the generated environment. After this initial phase, the model generates tests steps related to flying. Lifting off, returning to home base, landing on a landing pad, etc.

I guess the usefulness of this depends on the accuracy of the simulation, as well as how the sensors are used. Overall, it does seem like a good approach to generate environments and scenarios, with various invariant based checks for generic test oracles. For guiding a car based on camera input, maybe not this is not so good. For some sensor type where the input space is well known, and realistic models of the environment and real sensor data are less complex, and easier to create?

Testing ML Algorithms

A large part of the functioniality of ML applications depends on ML frameworks and their implementation of algorithms. Thus their correctness is quite related to the applications, and there are a lot of similar aspects to consider. As noted many times above, it is difficult to exactly know what is the expected result of a ML model, when the "prediction" is correct and so on. This is exactly the same scenario with the implementation of those learning algorithms, as their output is just those models, their training, the predictions they produce, and so on. How do you know what it produced is correct? That the implementation of those models and their training is "correct"?

A generic approach to test anything when you have multiple implementations is to pit those implementations against each other. Meaning to run them all with the same input and configurations, and compare the outputs. In (Pham2019) such an approach is investigated by using Keras as the application programming interface (API), and various backends for it. The following figure illustrates this:

Keras Layers

The basic approach here is to run the different backends for the same models and configurations, and compare results. There are often some differences in the results of those algorithms, even if the implementation is "correct", due to some randomness in the process, the GPU operations used, or some other factor. Thus, (Pham2019) rather compares the distributions and "distances" of the results to see if there are large deviations in one of the multiple implementations compared to others.

This topic is explored a bit further in (Zhang2018(2)), where issues (bug reports etc) against Tensorflow are explored. The issues are accessed via the project Github issue tracker, and related questions on StackOverflow. Some of these remind me of the topics listed by Karpathy’s recent post on model training "gotchas".

One from (Zhang2018) that I have found especially troublesome myself is "stochastic execution", referring to different runs with same configurations, models, and data producing differing results. In my experience, even with fixing all the seed values, exact reproducibility can be an challenge. At least in Kaggle competitions where aiming for every slightest gain possible :). In any case, when testing the results of trained ML algorithms across iterations, I have found exact reproducability can be an issue in general, which this refers to.


In the above sections, I have discussed the general testing of ML applications (models). This is the viewpoint of using techniques such as metamorphic testing to generate numerous variants of real scenarios, and the metamorphic relations as a test oracle. This does not address the functionality of the system from a viewpoint of a malicious adversary. Or generally of "stranger" inputs. In machine learning research, this is usually called "adversary input" (Goodfellow2018).

An example from (Goodfellow2018) is to fool an autonomous car (surprise) to misclassify a stop sign and potentially lead to an accident or other issues. The following example illutrates this from the web-version of the (Goodfellow2018) article:

Adversarial example

The stop sign on the left is the original, the one on the right is the adversarial one. I don’t see much of a difference, but a ML classifier can be confused by embedding specific tricky patterns into the image.

Such inputs are designed to look the same to a human observer, but modified just slightly in ways that fool the ML classifier. Typically this is done by embedding some patterns invisible to the human eye into the data, that the classifier picks up. These are types of patterns not present in the regular training data, which is why this is not visible during normal training and test. Many such attacks have been found, and ways to design them developed. A taxonomy in (Goodfellow2018) explores the possible ways, difficulty levels vs adversarial goals. The basic goal starting from reducing confidence (sometimes enough to satisfy adversarial needs) to full misclassification (can’t go more wrong than that).

The "bullet-proof" way to create such inputs as desfribed in (Goodfellow2018) is to have full knowledge of the model, its (activation) weights, and input space. To create the descired misclassification, the desired output (misclassification) is first defined, followed by solving the required constraints in the model backwards to the input. This reminds me of the formal methods vs testing discussions that have been ongoing for decades in the "traditional" software engineering side. I have never seen anyone apply format methods in practice (I do know for safety-critical it is used and very important), but can see their benefit in really safety critical applications. I guess self-driving (autonomous) cars would qualify as one. And as such this type of approach to provide further assurance on the systems would be very benefical.

I would expect that the wide-spread practical adoption of such a approach would, however, require this type of analysis to be provided as a "black-box" solution for most to use. I don’t see myself writing a solver that solve a complex DL network backwards. But I do see the benefits, and with reasonable expertise required could use it. I had trouble following the explanation and calculations in (Goodfellow2018) and believe many could be willing to pay for the tools and services where available (and if having the funding, which I guess should be there if developing something like autonomous cars). As long as I could trust and believe in the tool/service quality.

This topic is quite extensively studied and just searching for adversarial machine learning on the Internets gives plenty of hits. One recent study on the topic is (Elsayed2018) (also with Goodfellow..). It shows some interesting numbers on how many of the adversarial inputs fool the classifier and how many fool a human. The results do not show 100% accurate fail so perhaps that and the countermeasures discussed in (Goodfellow2018) and similar papers, such as verifying expected properties of input, could make for a plausible way to implement useful input validation strategies against this, similar to ones I discussed earlier in this post.

Test Coverage for ML applications

Test coverage is one of the basic building blocks of software testing. If you ever want to measure something about your test efforts, test coverage is likely to be at the very top of the list. It is then quite natural to start thinking about measuring test coverage in terms of ML/DL models and their applications. However, similar to ML testing, test coverage in ML is a bit more complicated topic.

Traditional software consists of lines of code, written to explicitly implement a specific logic. It makes sense to set a goal of having high test coverage to cover (all) the lines of code you write. The code is written for a specific reason, and you should be able to write a test for that functionality. Of course, real-world is not that simple, and personally, I have never seen 100% code coverage achieved. Of course, there are other criteria, such as requirements coverage as well, that can also be mapped in different ways to code and other artefacts.

As discussed throughout this post, the potential input space for ML models is huge. How do you evaluate the coverage of any tests over such models? For example, two common models referenced in (Ma2018) are VGG-19 and Resnet-50, with 16000 and 94000 neurons over 25 and 176 layers, respectively. Other models such as recent NLP models from OpenAI have much more. Combined with the potential input space, that is quite a few combinations to cover everything.

Coverage criteria for neural nets are explored in (Ma2018), (Sun2018), (Li2019) (and I am sure other places..). As far as I can see, they all take quite a similar approach to coverage measurement and analysis. They take a set of the valid input data and use it to train and test the model. Then measure various properties of the neural nets, the neuron activations, layer activations, etc. as a basis for coverage. Some examples of coverage metrics applied:


  • k-multisection: splits the neuron activation value range to k-sections and measures how many are covered
  • boundary: measures explicitly boundary (top/bottom) coverage of the activation function values
  • strong boundary: same as boundary but only regarding the top value
  • top-k: how many neurons have been the top-k active ones in a layer
  • top-k patterns: pair-wise combination patterns over top-k


  • neurons: neurons that have achieved their activation threshold
  • modified condition/decision coverage: different inputs (neurons) causing activation value of this neuron to change
  • boundary: activation values below/above set top/bottom boundary values

The approaches typically use the results to guide where to generate more inputs. Quite often this just seems to be another way to generate further adversarial inputs, or to identify where the adversarial inputs could be improved (increased coverage).

The results in (Ma2018) also indicate how different ML models for the same data require different adversarial inputs to trigger the desired output (coverage). This seems like a potentially useful way to guard against such adversarial inputs as well, by using such "coverage" measures to help select diverse sets of models to run and compare.

The following figure is a partial snippet from (Ma2018), showing some MNIST data:


MNIST is a dataset with images of written letters and digits. In this case, a number of adversarial input generators have been applied on the original input. As visible, the adversarial inputs are perturbed from the original, and in ways which do actually not seem too strange, simply somewhat "dirty" in this case. I can see how robustness could be improved by including these, as in the real world the inputs are not perfect and clean either.

Another viewpoint is presented in (Kim2019), using the term surprise adequacy, to describe how "surprised" the DL system would be based on specific inputs. This surprise is measured in terms of new added coverage over the model. The goal is to produce as "surprising input" as possible, to increase coverage. I assume withing limits of realistic inputs, of course. In practice this seems to translate to traversing new "paths" in the model (activations), or distance of activation values from previous activation values.

Some criticism on these coverage criteria is presented in (Li2019). Mainly with regards to whether such coverage criteria can effectively capture the relevant adversary inputs due to the extensive input and search space. However, I will gladly take any gain I can get.

It seems to me the DNN coverage analysis as well as its use are still work in progress, as the field is finding its way. Similary to metamorphics testing, this is also probably a nice time to publish papers, as shown by all the ones I listed from just the past year here. Hopefully in the following years we will see nice practical tools and methods to make this easier from application perspective.

Final Thoughts

Many of the techniques I discussed in the above sections are not too difficult to implement. The machine learning and deep learning frameworks and tools have come a long way and provide good support for building classifiers with combinations of different models (e.g., CNN, RNN, etc). I have played with these using Keras and Tensorflow as a backend, and there are some great online reasources available to learn from. Of course, the bigger datasets and problems do require extensive efforts and resources to collect and fine-tune.

As a basis for considering the ML/DL testing aspects more broadly, some form of a machine learning/deep learning defect model would seem like a good start. There was some discussion on different defect types in (Zhang2018(2)) from the viewpoint of classifying existing defects. From the viewpoint of this post, it would be interesting to see some related to

  • testing and verification needs,
  • defect types in actual use,
  • how the different types such as adversarial inputs relate to this,
  • misclassifications,
  • metamorphic relations
  • domain adaptations

This type of model would help more broadly analyze and apply testing techniques to real systems and their needs.

Since I have personally spent time working on model-based testing, the idea of applying MBT with MT for ML as in (Lindvall2017) to create new scenarios in combination with MT techniques seems interesting to me. Besides testing some ML algorithms on their own, this could help build more complex overall scenarios.

Related to this, the approaches I described in this post, seem to be mostly from a relatively static viewpoint. I did not see consideration for the changing sensor (data collection) environments, which seems potentially important. For example:

  • What happens when you change the cameras in your self-driving car? Add/Remove?
  • The resolution of other sensors? Radar? LIDAR? Add new sensors?
  • How can you leverage your existing test data to provide assurance with a changed overall "data environment"?

The third one would be a more general question, the first two would change with domains. I expect testing such changes would benefit from metamorphic testing, where the change being tested is from the old to the new configuration, and relations measure the change in the ultimate goal (e.g., driving guidance).

Data augmentation is a term used in machine learning to describe new and possibly synthetic data added to existing model training data to improve model performance. As I already discussed in some parts of this post above, it can be tempting to just consider the data generated by metamorphic testing as more training data. I think this can make sense, but I would not just blindly add all possible data, but rather focus the testing part of evaluating the robustness and overall performance of the model. Keeping that in mind as adding more data, to be able to still independently (outside the training data) perform as much verification as possible. And not add all possible data if it shows no added benefit in testing.

Overall, I would look forward to getting more advanced tools to apply the techniques discussed here. Especially in areas of:

  • metamorphic testing and related application to ML models and related checks
  • GAN application frameworks, on top of other frameworks like Keras or otherwise easily integratable in other tools
  • adversarial input generation tools using different techniques
  • job offers πŸ™‚

Its an interesting field to see how ML applications develop, autonomous cars and everything else. Now I should get back to Kaggle and elsewhere to build some mad skills in this constantly evolving domain, eh.. πŸ™‚


A. Dwarakanath et al., "Identifying implementation bugs in machine learning based image classifiers using metamorphic testing", 27th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA), 2018

J. Ding, X-H. Hu, V. Gudivada, "A Machine Learning Based Framework for Verification and Validation of Massive Scale Image Data", IEEE Transactions on Big Data, 2017.

G. F. Elsayed et al., "Adversarial Examples that Fool both Computer Vision and Time-Limited Humans", 32nd Conference on Neural Information Processing Systems (NeurIPS), 2018.

I. Goodfellow, P. McDaniel, N. Papernot, "Making Machine Learning Robust Against Adversarial Inputs", Communications of the ACM, vol. 61, no. 7, 2018.

H. Hosseini et al., "On the Limitation of Convolutional Neural Networks in Recognizing Negative Images", 6th IEEE International Conference on Machine Learning and Applications, 2017.

J. Kim, R. Feldt, S. Yoo, "Guiding Deep Learning System Testing using Surprise Adequacy", International Conference on Software Engineering (ICSE), 2019.

Z. Li et al., "Structural Coverage Criteria for Neural Networks Could Be Misleading", International Conference on Software Engineering (ICSE), 2019.

M. Lindvall et al., "Metamorphic Model-based Testing of Autonomous Systems", IEEE/ACM 2nd International Workshop on Metamorphic Testing (MET), 2017.

M-Y. Liu, T. Breuel, J. Kautz, "Unsupervised Image-to-Image Translation Networks", Advances in Neural Information Processing Systems (NIPS), 2017.

L. Ma et al., "DeepGauge: Multi-Granularity Testing Criteria for Deep Learning Systems", 33rd ACM/IEEE International Conference on Automated Software Engineering (ASE), 2018.

H.V. Pham, T. Lutellier, W. Qi, L. Tan, "CRADLE : Cross-Backend Validation to Detect and Localize Bugs in Deep Learning Libraries", International Conference on Software Engineering (ICSE), 2019.

Y. Sun et al., "Concolic testing for deep neural networks", 33rd ACM/IEEE International Conference on Automated Software Engineering (ASE), 2018

Y. Tian et al., "DeepTest: Automated Testing of Deep-Neural-Network-driven Autonomous Cars", 40th International Conference on Software Engineering (ICSE), 2018.

M. Zhang (1) et al., "DeepRoad: GAN-Based Metamorphic Testing and Input Validation Framework for Autonomous Driving Systems", 33rd ACM/IEEE International Conference on Automated Software Engineering (ASE), 2018.

Y. Zhang (2) et al., "An Empirical Study on TensorFlow Program Bugs", 27th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA), 2018.

Z. Q. Zhou, S. Xiang, T. Y. Chen, "Metamorphic Testing for Software Quality Assessment : A Study of Search Engines", IEEE Transactions on Software Engineering, vol. 42, no. 3, 2016.

Z. Q. Zhou, L. Sun, "Metamorphic Testing of Driverless Cars", Communications of the ACM, vol. 62, no. 3, 2019.

Machine Learning in Software Testing

Early 2019 Edition


Software testing has not really changed all that much in the past decades. Machine learning on the other hand is a very rapidly evolving technology being adopted all over the place. So what can it bring to software testing?

Back in 2018 (so about a year ago from now) I did a review of machine learning (ML) (and deep learning (DL)) applications in Network Analysis and Software Testing. After spending some more time learning and trying ML/DL in practice, this is an update on the ML for testing part, reflecting on my own learnings and some developments over this past year. Another interesting part would be testing ML system. I will get to that in another post.

In my last years review, I focused on several topics. A recent academic study (Durelli2019) in this area also lists a number of topics. This includes topics such as "learning test oracles", which basically translates to learning a model of a system behaviour based on some observations or other data about the software behaviour. Last years I included this under the name of specification mining. In practice, I have found such learned behavioral models to be of limited use, and have not seen general uptake anywhere in practice. In this review I focus on fewer topics I find more convincing for practical use.

I illustrate these techniques with this nice pic I made:


In this pic, the "Magic ML Oracle" is just a ML model, or a system using such a model. It learns from a set of inputs during the training phase. In the figure above this could be some bug reports linked to components (file handling, user interface, network, …). In the prediction phase it runs as a classifier, predicting something such as which component a reported issue should be assigned to, how fault-prone an analysis is (e.g., how to focus testing), or how tests and specs are linked (in case of missing links).

The topics I cover mainly relate to using machine learning to analyze various test related artefacts as in the figure above. One example of this is the bug report classifier I built previously. Since most of these ML techniques are quite general, just applied to software testing, ideas from broader ML applications could be useful here as well.

Specifically, software testing is not necessarily that different from other software engineering activities. For example, Microsoft performed an extensive study (Kim2017) on their data scientists and their work in software engineering teams. This work includes bug and performance analysis and prioritization, as well as customer feedback analysis, and various other quality (assurance) related topics.

As an example of concrete ML application to broader SW engineering, (Gu2018) maps natural language queries to source code to enable code search. To train a DL model for this, one recurrent neural network (RNN) based model is built for the code description (from source comments), and another one for the source code. The output of these two is a numerical feature vector. Cosine similarity is a measure used to compare how far apart two such vectors are, and here it is used as the training loss function. This is a nice trick to train a model to map source code constructs to natural language "constructs", enabling mapping short queries to new code in similar ways. It is also nicely described in the morning paper. I see no reason why such queries and mappings would not work for test related searches and code/documents as well. In fact, many of the techniques I list in following sections use quite similar approaches.

Like I said, I am focusing on a smaller set of more practical topics than last year in this still-overly-long post. The overall idea of how to apply these types of techniques in testing in general should not be too different though. This time, I categorize topics to test prioritization bug report localization, defect prediction, and traceability analysis. One section to go over each, one to rule over them all.

Test Prioritization

As (software) organizations and projects grow over time, their codebase tends to grow with them. This would lead to also having more tests to cover that codebase (hopefully…). Running all possible test cases all the time in such a scenario is not always possible or cost-efficient, as it starts to take more and more time and resources. The approach to selecting a subset of the tests to execute has different names: test prioritization, test suite optimization, test minimization, …

The goal with this being to cover as much of the fault-prone areas with fewer tests, such as in this my completely made up image to illustrate the topic:

Prioritized coverage

Consider the coverage % in the above to reflect, for example, covering changes since last time tests were run. Aiming to cover changes did not break anything as fast and efficient as possible.

An industrial test prioritization system used at Google (in 2017) is described in (Memon2017). This does not directly discuss using machine learning (although mentions it as future plan for the data). However, I find it interesting for general data-analysis of testing related data as a basis for test prioritization. It also works to provide a basis for a set of features for ML algorithms, as understanding and tuning your data is really the basis for all ML applications.

The goal in this (Memon2017) case is two-fold: better utilizing test resources (focus on potentially failing tests) and provide feedback to the developers about their commits. The aim is not 100% accurate predictions but rather focusing automated test execution and providing the developer with feedback such as "this commit is 95% more likely to cause breakage due to the code being touched by 5 developers in the past 10 days, and being written in Java". The developers can use this feedback to seek further assurance in additional reviews, more testing, static analysis, and so on.

Some of the interesting features/findings described in (Memon2017):

  • Only about 1% of the tests ever failed during their lifetime.
  • Thus about 99% speedup would be possible if right tests could be identified.
  • Use of dependency distance as a feature: What other component depends on the changed component, and through how many other components
  • Test targets further away from the change are much less likely to fail. So dependency distance seems like a useful prediction (feature) metric. They used a threshold of 10 for their codebase, which might vary by project but the idea likely holds.
  • Files/targets modified more often are more likely to cause breakages.
  • File type affects likelihood of breakage.
  • User/tool id affects likelihood of breakage.
  • Number of developers having worked on single file in a short time affects likelihood of breakage. More developers means higher likelihood to break.
  • The number of test targets affected by a change varies greatly, maybe requiring different treatment.

A similar set of features is presented in (Bhagwan2018):

  • Developer experience: Developer time in the organization and project
  • Code ownership: More developers changing files/components cause more bugs
  • Code hotspots: Specific parts of code that cause issues when changed
  • Commit complexity: Number of changes, changed files, review comments in a single commit. More equals more bugs.

A test prioritization approach taken at is described in (Busjaeger2016). They use five types of features:

  • code coverage
  • text path similarity
  • text content similarity
  • test failure history
  • test age

In (Bhagwan2018), the similarity scores are based on TF-IDF scores and their cosine similarity calculation. TF-IDF simply weights frequency of words in a document against the frequency of the same word in all other documents, to identify most specific terms for document types. The features are fed into a support-vector model to rank tests to execute first. In their (Bhagwan2018) experience, about 3% of overall tests is required to reach about 75% coverage. From the predicted tests, about every 5th is found to be causing a failure.

I find the above provide good examples of data analysis, and basis for defining ML features.

In the (Durelli2019) review, several studies are listed under test prioritization, but these mostly do not strike me as very realistic examples of ML applications. However, one interesting approach is (Spieker2017), which investigates using reinforcement learning for test prioritization. It uses only three features: execution time (duration), last execution (whatever that means..), and failure history. These seem a rather simple set of features to build a complex model, and it seems likely to me that a simple model would also work here. The results in (Spieker2017) are presented as good but not investigated in depth so hard to say from just that. However, I did find the approach to present some interesting ideas in relation to this:

  • Continuous integration systems constantly execute the test suites so you will have a lot of constantly updating data about test suites, execution, results available
  • Continuously updating the model over time based on a last N test runs from past
  • Using a higher exploration rate over full suite to bootstrap the model, lowering over time when it has learned but not setting to zero
  • Using "test case" as model state, and assigning it a priority as an action
  • Listing real "open-data" industry-based datasets to evaluate prioritization ML models on

I would be interested to see how well a simple model, such as Naive Bayes, weighting the previous pass/fails and some pattern over their probability would work. But from the paper it is hard to tell. In any case, the points above would be interesting to explore further.

A Thought (maybe Two)

I assume ML has been applied to test prioritization, just not so much documented. For example, I expect Google would have taken their studies further and used ML for this as discussed in their report (Memon2017). Test prioritization seems like a suitably complex problem, with lots of easily accessible data, and with a clear payoff in sight, to apply ML. The more tests you have, the more you need to execute, the more data you get, the more potential benefit.

In this as in many advanced research topics, I guess the "killer app" might come by integrating all this into a test system / product as a black-box. This would enable everyone to make use of it without requiring to learn all the "ML in test" details outside their core business. Same I guess applies to the other topics I cover in the following sections.

Bug Report Localization

Bug report localization (in this case anyway) involves taking a bug report and finding the component or other part of the software that the report is most likely to concern. Various approaches aim to automate this process by using machine learning algorithms. My previous example is one example of building one.

I made a pretty picture to illustrate this:

Localization Oracle

Typically a bug report is expressed in natural language (at least partially, with code snippets embedded). These are fed to the machine learning classifier (magic oracle in the pic above), which assigns it to 1-N potential components. Component granularity and other details may wary but this is the general idea.

For this, code structural elements used as features include (Tufano2018):

  • sequences of abstract syntax tree (AST) nodes
  • sequences of call-flow-graphs (CFG) nodes
  • bytecode representations. This seems interesting in mapping the code to fewer shared elements (opcodes)

Other features include (Lam2017):

  • camel-case splitting source code (n-grams would seem a natural fit too)
  • time since a file was previously changed when fixing a bug
  • how many bugs overall have been fixed in a file
  • similarity between a bug report and previous bug reports (and what were they assigned to)

Besides using such specific code structures as inputs, also specific pre-processing steps are taken. These include (Tufano2018, Li2018):

  • replacing constant value with their types,
  • splitting camel-case,
  • removing low-level detailed abstract syntax tree (AST) nodes,
  • filtering out methods less than 10 lines long.
  • regular expressions to remove code format characters, and to identify code snippets embedded into the bug report.

An industrial case study on bug localization from Ericsson is presented in (Johnson2016). Topic models built with Latent Dirichlet Allocation (LDA) are learned from the set of bug reports. These are used to assign topic weights to bug reports based on the bug report text. The assigned weights are compared to the learned topic distribution for components, and the higher the match of distributions in the report vs learned component model, the higher the probability to assign the bug report to that component.

Vector Space Model (VSM) was used as a baseline comparison in many cases I found. This is based on TF-IDF scores (vectors) calculated for a document. Similarity between a bug report and source code files in VSM is calculated as a cosine similarity between their TF-IDF vectors. Revised Vector Space Model (rVSM) (Zhou2012) is a refinement of VSM that weights larger documents more, reasoning that bugs are more often found in larger source files. (Zhou2012) also adds weighting from similarity with previous bug reports.

Building on rVSM, (Lam2017) uses an auto-encoder neural network on TF-IDF weighted document terms to map different terms with similar meaning together for more accurate bug localization. Similarly, the "DeepSum" work (Li2018) uses an auto-encoder to summarize bug reports, and to compare their TF-IDF distance with cosine similarity. To me this use of auto-encoders seems like trying to re-invent word-embeddings for no obvious gain, but probably I miss something. After auto-encoding, (Lam2017) combines a set of features using a deep neural network (multi-layer perceptron (MLP) it seems) for final probability evaluation. In any case, word-embedding style mapping of words together in a smaller dimension is found in these works as others.

A Thought or Two

I am a bit surprised not to see much work in applying RNN type networks such as LSTM and GRU into these topics, since they are a great fit for processing textual documents. In my experience they are also quite powerful compared to traditional machine learning methods.

I think this type of bug report localization has practical relevance mainly for big companies with large product teams and customer bases, and complex processes (support levels, etc). This is in domains like telcos, from which the only clear industry report I listed here is from (Ericsson). Something I have found limiting these types of applications in practice is also the need for cross-domain vision to combine these topics and expertise. People seem often quite narrowly focused on specific areas. Black-box integration with common tools might help, again.

Defect Prediction

Software defect prediction refers to predicting which parts of the software are most likely to contain faults. Sometimes this is also referred to as fault proneness analysis. Aim is to provide additional information to help focus testing efforts. This is actually very similar to the bug report localization I discussed above, but with the goal of predicting where currently unknown bugs might appear (vs localizing existing issue reports).

An overall review of this area is presented in (Malhotra2015), showing an extensive use of traditional ML approaches (random forest, decision trees, support vector machines, etc) and traditional source code metrics (lines of code, cyclomatic complexity, etc.) as features. These studies show reasonably good accuracies up from 75% to 93%.

However, another broad review on these approaches and their effectiveness is presented in (Zhou2018). It shows how simply using larger module size to predict higher fault proneness would give equal or better accuracy in many cases. This is my experience from many contexts, keeping it simple is often very effective. But on the other hand, finding that simplicity can be the real challenge, and you can learn a lot by trying different approaches.

More recently, deep learning based approaches have also been applied to this domain. Deep Belief Nets (DBN) are applied in (Wang2018) to generate features from source code AST, and combined with more traditional source code metrics. The presentation on DBNs in (Zhou2018) is a bit unclear to me, but it seems very similar to a MLP. The output of this layer is then termed (as far as I understand) as "semantic feature vector". I looked a bit into the difference of DBN vs MLP, and found some practical discussion at a Keras issue. Make what you will of it. Do let me know if you figure it out better than I did (what is the difference in using a MLP style fully connected dense layer here vs DBN).

An earlier version of the (Wang2018) work is refined and further explored using convolutional neural networks (CNNs) in (Li2017). In this case, a word2vec word-embedding layer is used as the first layer, and trained on the source and AST vocabulary. This is fed into a 1-dimensional CNN, which is one of the popular deep learning network types for text processing. After passing through this part of the network, the output feature vector is merged with a set of the more traditional source metrics (lines of code, etc). These are together merged for the final network layers to do the prediction, which are fed into the final single-node output layer for the probability prediction.

Illustration of this network:

Metrics based model

To address class imbalance (more "clean" than "buggy" files), (Li2017) uses duplication of the minority class instances. They also compare to traditional metrics as well as the DBN from (Wang2018) and DBN+ whichs combines the traditional features with the DBN "semantic" features. As usual for research papers, they report getting better results with the CNN+ version. Everyone seems to do that, yet perfection seems never to be achieved, or even nearly. Strange.

A Thought

The evolution in defect prediction seems to be from traditional classifiers with traditional "hand-crafted" (source metrics) features to deep-learning generated and AST-based features in combination with traditional metrics. Again, I did not see any RNN based deep-learning classifier applications, although I would expect they should be quite well suited for this type of analysis. Perhaps next time.

Traceability Analysis

Despite everyone being all Agile now, heavier processes such as requirements traceability can still be needed. Especially for complex enough systems, and ones with heavy regulatory- or standards-based compliance requirements. Medical, telco, automotive, … In the real world, such traces may not always be documented, and sometimes it is of interest to find them.

A line of work exploring the use of deep learning for automating the generation of traceability links between software artefacts is in (Guo2017, Rath2018). These are from the same major software engineering conference (ICSE) over two following years, with some author overlap. So there is some traceability in this work too, heh-heh (such joke, much non-academic). The first one (Guo2017) aims to link requirements to design and test artefacts in the train control domain. The second one aims to link code submissions to issues/bug reports on Github.

Requirements documents

Using recurrent neural networks (RNN) to link requirements documents to other documents is investigated in (Guo2017). I covered this work to some extend already last year, but lets see if I can add something with what I learned since.

Use cases fot this as mentioned in (Guo2017):

  • Finding new, missing (undocumented) links between artefacts.
  • Train on a set of existing data for existing projects, apply to find links within a new project. This seems like a form of transfer learning, and is not explored in the paper. It focuses on the first bullet.

I find the approach used in (Guo2017) interesting, linking together two recurrent neural network (RNN) layers from parallel input branches for natural language processing (NLP):

Requirements linking NN

There are two identical input branches (top of figure above). One for the requirements documents, and one for the target document for which the link is assessed. Let’s pretend the target is a test document to stay relevant. A pair of documents is fed to different input branches of the network, and the network outputs a probability of these two documents being linked.

In ML you typically try different model configurations and hyperparameters to find what works best. In (Guo2017) they tried different types of layers and parameters. The figure above shows what they found best for their task. See Guo2017) for the experiment details for other parameter values. Here, a bi-directional gated recurrent unit (bi-GRU) layer is used to process each document into a feature vector.

When the requirements document and the target document have been transformed by this to a vector representation, they are fed into a pointwise multiplication layer (mul) and to a vector substraction layer (sub). In Keras this would seem to be a Merge layer with type "mul" or "sub". These merge layers are intended to describe the vector difference direction (mul) and distance (sub) across dimensions. A dense layer with sigmoid activation is used to integrate these two merges, and the final output is given by a 2-neuron softmax layer (linked/not linked probability).

For word-embeddings they try both a domain specific (train-control systems in this case) embedding with 50-dimensions, and a 300-dimensional one combining the domain-specific data with a Wikipedia text dump. They found the domain specific one works better, speculating it to be due to domain-specific use of words.

Since this prediction can produce many different possibilities in a ranked order, simple accuracy of the top choice is not "accurate" itself as an evaluation metric. For evaluating the results, (Guo2017) uses mean average prediction (MAP) as a metric. The MAP achieved in (Guo2017) is reported up to 83.4%. The numbers seem relatively good, although I would need to play with the results to really understand everything in relation to this metric.

An interesting point from (Guo2017) is a way to address class imbalances. The set of requirements and other documents that have valid links they have is a small fraction of the overall set. So the imbalance between the true and false labels is big. They address this by selecting an equal set of true and false labels for an epoch, and switching the set of false label items at the start of each epoch. So all the training data is processes, while a balance is held in each epoch. Nice.

Github Issues

Traceability for linking code commits to bug tracker issues and improvement tickets ("bugs" and "improvement" in project Jira) is presented in (Rath2018). The studied projects are 6 open-source projects written in Java. Unlike the previous study on requirements linking, this study does not use deep-learning based approaches but rather manual feature engineering and more traditional ML classifiers (decision trees, naive bayes, random forest).

This is about mapping issue reports to commits that fix those issues:

Github Issue Linking

Besides more traditional features, (Rath2018) also makes use of time related aspects as extra filtering features. A training set is built by finding commit messages that reference affected issue IDs. The features used include:

  • Timestamp of commit. Has to be later than creation timestamp for potential issue it could be linked to. Has to be inside given timeframe since issue was marked resolved.
  • Closest commit before analyzed commit, and its linked issues.
  • Closest commit after analyzed commit, and its linked issues.
  • Committer id
  • Reporter id
  • Textual similarity of commit source/message and issue text. TF-IDF weighted word- and ngram-counts.

The study in (Rath2018) looks at two different types of analysis for the accuracy of the ML classifier trained. In the first case they try to "reconstruct known links", and in the second case "construct unknown links". They further consider two different scenarios: recommending links for commits, and fully automated link generation. For assistance, their goal is to have the correct link tag in the top 3 suggestions. The automated tagging scenario requires the first predicted tag to be correct.

Not surprisingly, the top 3 approach has better results as it gives the classifier more freedom and leeway. Their results are reported with up to 95%+ recall but with a precision of around 30%. This seems to be in line with what I saw when I tried to build my issue categorization classifier. The first result may not always be correct but many good ones are in the top (and with too many possibilities, even the "correct" one might be a bit ambiguous).

The second use case of constructing previously unknown links sounds to me like it should be very similar in implementation to the first one, but it appears not. The main difference comes from there being large numbers of commits that do not actually address a specific Jira issue or ticket. The canonical example given is a refactoring commit. The obvious (in hindsight) result seems to state you are more likely to find a link if one is known to exist (case 1) vs finding one if it might not exist at all (case 2) :).

A Thought or Two

The point of the requirements linking approach finding the domain-specific word-embeddings better is interesting. In my previous LSTM bug predictor, I found domain specific training helps in similar way, although in that case also combining with the pre-trained word-embeddings worked nicely as well. Of course, I used a large pre-trained Glove embedding for that and did not train on Wikipedia myself. And used Glove vs Word2Vec but I would not expect a big difference.

However, the domain-specific embeddings performance sounds similar to ELMo, Bert, and other recent developments in context-sensitive embeddings. By training only on the domain-specific corpus, you likely get more context-sensitive embeddings for the domain. Maybe the train-control domain in (Guo2017) has more specific vocabulary, or some other attributes that make the smaller domain-specific embedding alone better? Or maybe the type of embedding and its training data makes the difference? No idea. Here’s hoping Elmo style contextual embeddings are made easy to add to Keras models soon, so I can more broadly experiment with those as well. In my obvious summary, I guess it is always better to try different options for different data and models..

Parting Notes

I tried to cover some different aspects of ML applications in software testing. The ones I covered seem to have quite a lot in common. In some sense, they are all mapping documents together. The set of features are also quite common, "traditional" source code metrics along with NLP features. Many specific metrics have also been developed as I listed above, such as modification and modifier (commit author) counts. Deep learning approaches are used to some extent, but it still seems to be making its way in this domain.

Besides what I covered, there are of course other approaches to apply ML to SW testing. I covered some last year, and (Durelli2019) covers much more from an academic perspective. But I found the ones I covered here to be a rather representative set of the ones I consider closest to practical today. If you have further ideas, happy to hear.

In general, I have not seen much of ML applied in meaningful ways to software testing. One approach I have used in past is to use ML as a tool for learning about a test network and its services (Kanstren2017). I am not sure if that really qualifies for a ML application to software testing, since it investigated properties of the test network itself and its services, not the process of testing. Perhaps the generalization of that is in "using machine learning with testing technologies". This would be different from applying ML to testing itself, as well as different from testing ML applications. Have to think about that.

Next I guess I will see if/when I have some time to look at the testing ML applications part. With all the hype on self-driving cars and everything else, that should be interesting.

See, I made this nice but too small text picture of the tree facets of ML and SW Testing I listed above:

Test vs ML facets


R. Baghwan et al., "Orca: Differential Bug Localization in Large-Scale Services", 13th USENIX Symposium on Operating Systems Design and Implementation, 2018.

B. Busjaeger, T. Xie, "Learning for Test Prioritization: An Industrial Case Study", 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE), 2016.

N. DiGiuseppe, J.A. Jones, "Concept-Based Failure Clustering", ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering (FSE), 2012.

V. H. S. Durelli et al., "Machine Learning Applied to Software Testing: A Systematic Mapping Study", IEEE Transactions on Reliability, 2019.

X. Gu, H. Zhang, S. Kim, "Deep code search", 40th International Conference on Software Engineering (ICSE), 2018.

J. Guo, J. Cheng, J. Cleland-Huang, "Semantically Enhanced Software Traceability Using Deep Learning Techniques", 39th IEEE/ACM International Conference on Software Engineering (ICSE), 2017.

L. Johnson, et al., "Automatic Localization of Bugs to Faulty Components in Large Scale Software Systems Using Bayesian Classification", IEEE International Conference on Software Quality, Reliability and Security (QRS), 2017.

T. Kanstren, "Experiences in Testing and Analysing Data Intensive Systems", IEEE International Conference on Software Quality, Reliability and Security (QRS, industry track), 2017

M. Kim, et al., "Data Scientists in Software Teams: State of the Art and Challenges", IEEE Transactions on Software Engineering, vol. 44, no. 11, 2018.

A. N. Lam, A. T. Nguyen, H. A. Nguyen, T. N. Nguyen, "Bug Localization with Combination of Deep Learning and Information Retrieval", IEEE International Conference on Program Comprehension, 2017.

J. Li, P. He, J. Zhu, M. R. Lye, "Software Defect Prediction via Convolutional Neural Network", IEEE International Conference on Software Quality, Reliability and Security (QRS), 2017.

X. Li et al., "Unsupervised deep bug report summarization", 26th Conference on Program Comprehension (ICPC), 2018.

R. Malhotra, "A systematic review of machine learning techniques for software fault prediction, Applied Soft Computing, 27,2015.

A. Memon et al., "Taming Google-scale continuous testing", 2017 IEEE/ACM 39th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP), 2017.

M. Rath, J. Rendall, J.L.C Guo, J. Cleland-Huang, P. MΓ€der, "Traceability in the wild: Automatically Augmenting Incomplete Trace Links", 40th IEEE/ACM International Conference on Software Engineering (ICSE), 2018.

M. Tufano et al., "Deep learning similarities from different representations of source code", 15th International Conference on Mining Software Repositories (MSR), 2018.

S. Wang, T. Liu, J. Nam, L. Tan, "Deep Semantic Feature Learning for Software Defect Prediction", IEEE Transactions on Software Engineering, 2018.

J. Zhou, H. Zhang, D. Lo, "Where should the bugs be fixed? More accurate information retrieval-based bug localization based on bug reports", International Conference on Software Engineering (ICSE), 2012.

Y. Zhou et al, "How Far We Have Progressed in the Journey? An Examination of Cross-Project Defect Prediction", ACM Transactions on Software Engineering and Methodology, no. 1, vol. 27, 2018.