Previously I looked at what it means to test machine learning systems,
and how one might use machine learning in software testing.
Most of the materials I found on testing machine learning systems was academic in nature, and as such a bit lacking in practical views.
Various documents on the Uber incident (fatally hitting a pedestrian/cyclist)
have been published, and I had a look at those documents to find a bit more insights into what it might
mean to be testing overall systems that rely heavily on machine learning components. I call them machine-learning intensive systems. Because.
There are several articles published on the accident, and the information released for it.
I leave the further details and other views for those articles,
while just trying to find some insights related to the testing (and development) related aspects here.
However, a brief overview is always in order to set the context.
This accident was reported to have happened on 18th of March, 2018.
It involved the Uber test car (a modified Volvo X90) hitting a person walking with a bicycle.
The person died as a result of the impact.
This was on a specific test route that Uber used for their self-driving car experiments.
There was a vehicle operator (VO) person in the Uber car,
whose job was to oversee the autonomous car performance, mark any events of interest (e.g., road debris, accidents, interesting objects), label objects
in the recorded data, and take over control of the vehicle in case of emergency or other need (system unable to handle some situation).
The records indicate there used to be two VO’s per car, one focusing more on the driving, and one more on recording events.
They also indicate that before the accident, the number of VO’s had been reduced to just one.
The roles were combined and an update of the car operating console was designed to address the single VO being able to perform both roles.
The lone VO was then expected to keep a constant eye on the road, monitor everything, label the data, and perform any other tasks as needed.
Use of mobile phone was prohibited during driving by the VO, but the documents for the accident indicate the VO had been eyeing the
spot where their mobile device was located, inside a slot on the dashboard.
The documents also indicate the VO had several video streaming applications installed,
and records from the Hulu streaming service showed video streaming occurring on the VO account at the time of the accident.
The accident itself was a result of many combined factors,
where the human VO seems to have put their attention elsewhere at just the time,
and the automation system has failed to react properly to the pedestrian / cyclist.
Some points on the automation system in relation to potential failures:
- The system kept records of each moving object / actor and their movement history, using their previous movements and position as an aid to predict their future movements. The system was further designed to discard all previous movement (position) history information when it changed the classification of an object. So no history was available to predict movement of an object / actor, if its classification changed.
- The classification of the pedestrian that was hit kept changing multiple times before the crash. As a result, the system constantly discarded all information related to it, severely inhibiting the system from predicting the pedestrians movement.
- The system had an expectation to not classify anything outside a road crossing as a pedestrian. As such, before the crash, the system continously changed the pedestrian classification between vehicle, bicycle, or other. This was the cause of losing the movement history. The system was not designed for the possibility of someone walking on the road outside a crossing area.
- The system had safeguards in place to stop it from reacting too aggressively. A delay of 1 second was in place to delay braking when a likely issue was identified. This delayed automatic braking even at a point where a likely crash was identified (as in the accident). The reasoning was to avoid too aggressive reactions to false positives. I guess they expected the VO to react, and to log issues for improvement.
- Even when the danger was identified and automated braking started, it was limited to reasonable force to avoid too much impact on the VO. If this maximum braking was calculated as insufficient to avoid impact, the system would brake even less and emit an audio signal for the VO to take over. So if maximum is not enough, slow down less(?). And the maximum emergency braking force was not set very high (as far as I understand..).
Before the crash, the system thus took an overly long time to identify the danger due to bad assumptions (no pedestrian outside crossing). It lost pedestrian movement history due to dropping data on classification change. It waited for 1 second from finally identifying the danger to do anything, and then initiated a slowdown rather than emergency braking. And the VO seemed to be distracted from observing the situation.
After the accident, Uber has moved to address all these issues.
There are several other documents on various aspects of the VO,
the automation system, and the environment available on the National Transportation Safety Board website for those interested.
Including nice illustrations of all aspects.
This was a look at the accident and its possible causes to give some context.
Next a look at the system architecture to also give some context of potential testing approaches.
Uber System Architecture
Looking at testing in any domain, understanding the system architecture is important. A look.
The Uber document on the topic lists the following main software modules:
- Perception: Collects data from different sensors around the car
- Localization: Combines detailed map data with sensor data for accurate positioning
- Prediction: Takes Perception output as input, predicts actions for actors and objects in the environment.
- Routing and Navigation: Uses map data, vehicle status, operational activity to determine long term routes for a given goal.
- Motion Planning: Generates shorter term motion plans to control the vehicle in the now. Based on Perception and Prediction inputs.
- Vehicle Control: Executes the motion plan using vehicle communication interfaces.
The same Uber document also describes the self-driving car hardware.
The current components at the time of writing the document:
- Light Detection and Ranging (LIDAR): Measuring distance to actors and objects, 100m+ range.
- Cameras: multiple cameras for different distances, covering 360 degrees around the vehicle. Both near- and far-range. To identify people and objects.
- Radar: Object detection, ranging, relative velocity of objects. Forward-, backward-, and side-facing.
- Global Positioning System (GPS): Coarse position to support vehicle localization (positioning it), vehicle command (to use location / position for control), map data collection, satellite measurements.
- Self-Driving Computer: A liquid-cooled local computer in the car to run all the SW modules (Perception, Prediction, Motion Planning, …)
- Telematics: Communication with backend systems, cellular operator redundancy, etc.
Planned components (not installed back then, but in future plans..):
- Ultrasonic Sensors: Uses echolocation to range objects. Front, back, and sides.
- Vehicle Interface Module: Seems to be an independent backup module to safely control and stop the vehicle in case of autonomous system faults.
Now that we established a list of the SW and HW components, a look at their functionality.
The system is described as using very detailed maps, including:
- Geometry of the road and curbs
- Drivable surface boundaries and driveways
- Lane boundaries, including paint lines of various types
- Bike and bus lanes, parking regions, stop lines, crosswalks
- Traffic control signals, light sets, and lane and conflict associations (whatever that is? :))
- Railroad crossings and trolley or railcar tracks
- Speed limits, constraint zones, restrictions, speed bumps
- Traffic control signage
Combined with precise location information,
the system uses these detailed maps to beforehand "predict" what type of environment lies ahead,
even before the Perception module has observed it.
This is used to prepare for the expected road changes, anticipate speed changes, and optimize for expected motion plans.
For example, when anticipating a tight turn in the road.
Perception and Prediction
The main tasks of the Perception module are described as detecting the environment, actors and objects.
It uses sensor data to continously estimate the speed, position, orientation,
and other variables of the objects and actors, as a basis to make better predictions and plans about their
future movement, velocity, and position.
An example given is the turn signals of other cars, which is used to predict
the their actions.
At the same time, all the other data is also recorded and used to predict other,
alternative courses for the same car, in case it does not turn even though using a turning signal.
While the Perception module observes the environment (collects sensor data),
the Prediction component uses this, and other available, data as a basis for predicting the movement of the other actors, and changes in the environment.
The observed environment can have different types of objects and actors in it.
Some are classified as fixed structures, and are expected not to move: buildings, ground, vegetation.
Others are classified as more dynamic actors, and expected to move: vehicles, pedestrians, cyclists, animals.
The Prediction module makes predictions on where each of these objects is likely to move in the next 10 seconds.
The predictions include multiple properties for each object and actor,
such as movement, velocity, future position, and intended goal.
The intended goal (or intention) is mentioned in the document,
but I did not find a clear description of how this would be used.
In any case, it seems plausible that the system would assign "intents" to objects,
such as pedestrian crossing a street, a car turning, overtaking another car,
going straight, and so on.
At least these would seem useful abstractions and input to the next processing module (Motion Planning).
The Prediction module makes predictions multiple times a second to keep an updated representation available.
The predictions are provided as input to the Route and Motion Planning module,
including the "certainty" of those predictions.
This (un)certainty is another factor that the Motion Planning module can use as input to apply more caution to any control actions.
Route and Motion Planning
Motion Planning (as far as I understand) refers to short-term movements,
translating to concrete control instructions for the car.
Route planning on the other hand refers to long term planning on where to go,
and gives goals for the Motion Planning to guide the car to the planned route.
Motion Planning combines information from generated route (Route Planning), perceived objects and actors (Perception), and
predicted movements (Prediction).
Mapping data is used for the "rules of the road", as well as any active constraints.
I guess also combined with sensor data for more up-to-date views in the local environment (the public docs are naturally not super-detailed on everything).
Using these, it creates a motion plan for vehicle.
Data from Perception and Prediction modules is also used as input,
to define the anticipated movements of other objects and actors.
A spatial buffer is defined to be kept between the vehicle and other objects in the environment.
My understanding is that this refers to keeping some amount of open space between the car and environmental elements.
The size of this buffer varies with variables such as autonomous vehicle speed (and properties and labels of other objects and actors I assume).
To preserve the required buffer, the system may take action such as changing lanes, brake, or stop and wait for situation to clear.
The system is also described as being able to identify and track occlusions in the environment.
These would be environmental elements, such as buildings or other cars, blocking a view to certain other parts of the environment.
These are constantly reasoned about,
and the system becomes more concervative in decision when occlusions are observed.
It aims to be able to avoid actors coming out of occlusions with reasonable speed.
The Vehicle Control module executes trajectories provided by the Motion Planning module.
It controls the vehicle through communication interfaces.
Example controls include steering, braking, turn signals, throttle, and switching gears.
It also tracks any limits set for the system (or environment?),
and communicates back to the operation center as needed.
Data Collection and Test Scenarios
Since my point with this "article" was to look into what it might mean to test a machine learning intensive system, I find it important to also look at what type of data is used to train the machine learning systems, and how is all the used data collected.
And how these are used as part of test case (In the Uber documents they seems to call them test scenarios).
Of course, such complex systems use this type of data for many different purposes besides just the machine learning part,
so they are generally interesting as well.
The Uber document describes data uses including system performance analysis, quality assurance, machine teaching and testing, simulated environment creation and validation, software development, human operator training and assessment, and map building and validation.
Summarizing the various parts related to data collection and synthesis from the Uber descriptions,
at the heart of all this are the real-world training data collected by the VO’s driving around, the car and automated sensors collecting detailed data,
and the VO’s tagging the data.
This tagging also helps further identify new scenarios, objects, and actors.
The sensor data is based on the sensors I listed above in the HW section.
Additionally, the system is listed as recording:
- telemetry (maybe refers to metrics about network? or just generally to transferring data?)
- control signals (commands for vehicle control?)
- Control Area Network (CAN) messages
- system health, such as
- hard drive speeds
- internal network performance
- computer temperatures
The larger datasets are recorded in onboard (car) storage.
Smaller amounts of data are transmitted in near real-time using over-the-air (OTA) interfaces over cellular networks to the Uber control center.
These use multiple cellular network for cybersecurity and resiliency purposes.
The OTA data includes insights on how the vehicles are performing, where they are, and their current state.
In the documents (Uber and another from the RAND corporation),
the operational environment of the autonomous vehicle is referred to as the operational design domain (ODD).
Defining the ODD is quite central to the development (as well as testing) of the system and training the ML algorithms,
as well as the controlling logic based on those.
It defines the world in which the car operates, and all the actors and objects, and their relations.
The Uber document describes using something called scenarios as test cases.
Well, it mostly does not mention the word "test case", but for practical purposes this seems to be similar.
Of course, this is quite a bit more complex than a traditional software test case with simple inputs and outputs,
requiring description of complex real-world environments as inputs, and boundaries and profiles of accepted behaviour as outputs, rather than specific data values.
These complex real-world inputs and outputs are also varying over time, different from the typical static input values as often is with traditional software tests.
Thus, also a time-series aspect is relevant to the inputs and outputs.
Uber describes a unified schema being used to describe the scenarios and data.
Besides the collected data and learned models, other data inputs are also used, such as operational policies.
Various success criteria are defined for each scenario, such as speed, distance, and description of safe behaviour.
When new actors, environmental elements, or other similar items are encountered,
they are recorded and tagged for further training of the autonomous system.
The resulting definitions and characterization of the ODD is then used as input to improve the test scenarios and create new ones.
This includes improving the test simulations, and test tracks for coverage.
Events such as large deviations between consequtive planned trajectories are recorded and automatically tagged for investigation.
Simulations are used to evaluate they are fixed, and the new scenarios are added to ML training datasets, or as "hard test cases".
This seems a bit similar to the Tesla "shadow mode" I discussed earlier, just a bit more limited.
Besides a general overview of the scenario development,
the Uber documents do not really discuss how they handle test coverage, or what types of tests they run.
There are some minor references but nothing very concrete.
It is more focused on describing the overall system, and some related processes.
I tried to collect here some points that I figured seemed relevant.
A key difference to more traditional software systems seems to be how these types of systems do not have a clearly defined input or output space.
The interaction interfaces (API/GUI) of traditional software systems naturally defines some contract for what type of input is allowed, and expected.
With these it is possible to apply the traditional techniques such as category partitioning, boundary analysis, etc.
When your input space is the real world and everything that can happen in it,
and output space is all the possible actions in relation to all the possible environmental configurations,
it gets a bit more complex.
In a similar comment, Uber describes their system as requiring more testing with different variations.
Potential Test Scenarios from Uber Docs
These are just points I collected that I though would illustrate something related to test scenarios and test coverage.
Uber describes evaluating their system performance in different common and rare scenarios,
using measurements such as traffic rule violations, and vehicle dynamic attributes.
This means having very few crash and unsafe scenarios available, but a large number of safe scenarios.
That is, when the scenarios are based on real-world use and data,
commonly there are much more "safe" scenarios available than "un-safe", due to rarity of crashes, accidents and other problem cases vs normal operations.
With only this type of highly biased data-set available, I expect there is a need to synthesize more extensive test sets,
or other methods to test and develop such systems more extensively.
The definition of safety also does not seem to be a binary decision but rather there can be different scales of "safe",
depending on the safety attribute.
For example, a safety margin of how far from other vehicles should the autonomous vehicle keep distance, is a continous variable, not a binary value.
Some variables might of course have binary representations, such as avoiding hitting a pedestrian, or ramming a red light.
But even the pedestrian metric may have similar distance measures, impact measures, etc.
So I guess its a bit more complicated than just safe or not safe.
Dataset augmentation and imbalanced datasets are common issues in developing and training ML models.
However, those techniques are (to my understanding) based on a single clear goal such as classification of an object,
not on complex output such as overall driving control and its relation to the real world.
Thus, I would expect to use overall scenario augmentation type of approaches, more holistic than a simple classifier (which in its own might be part of the system).
Some properties I found in the Uber documents (as I discussed above), referring to potential examples of test requirement:
Movement of objects in relation to vehicle.
Inability of the system to classify a pedestrian correctly if not near a crosswalk.
Inability of the system to predict pedestrian path correctly when not classified as pedestrian.
Overly strict assumptions made, such as cyclist not moving across lanes.
Losing location history of tracked objects and actors if their classification changed
Uber defines test coverage requirements based on collected map data and tags.
Map data predictin that upcoming environment would be of specific type (e.g., left curve), but it has changed and observations differ
Another car signals turning left but other predictors do not predict that, and the other car may not actually turn left.
Certainty of classifications.
Occlusions in the environment.
Looking at the above examples, trying to abstract some more generic concepts that would serve as a potentially useful basis:
Listing of known objects / actors
Listing of labels for different types of objects / actors
Assumptions made about specific types of objects / actors
Properties of objects / actors
Interaction constraints of objects and actors
Probabilities of classifications for different objects / actors and labels
Functionality when faced with unknown objects / actors
The above list may be lacking in general details that would cover the different types of systems, or even the Uber example,
but I find it provides an insight into how this is heavily about probabilities, uncertainty, and preparing for, and handling, that uncertainty.
For different types of systems, the actual objects, actors, labels and properties would likely change.
To illustrate these a bit more concretely with the autonomous car example:
Something that seems important is also the ability to reason about previously unknown objects and actors to an extent possible.
For example, a moving object that does not seem to fit any known category, but has known movement history, speed, and other variables.
Perhaps there would be a more abstract category of a moving object, or some hierarchy of such categories.
As well as the any of these objects or actors changing their classifications and goals, and how their long-term history should be taken into account overall to make future predictions.
In a different "machine learning intensive" system (not autonomous cars), one might use different set of properties, actors, object, etc.
But it seems some similar consideration could be useful.
Possible Test Strategies
Once the domain (the "ODD") is properly defined, as above, it seems many traditional testing techniques could be applied.
In the Uber documents, they describe performing architecture analysis to identify all potential failure points.
They divided faults into three levels: faults in the self driving system on its own, faults in relation to the environment (e.g., at intersections), and faults related to the operational design domain, such as unknown obstacles the system does not recognize (or misclassifies?).
This could be another way to categorize a more specific system, or inspiration for other similar systems.
Another part of this type of system could be related to the human aspect.
This is somewhat discussed also in the Uber docs, in relation to operational situations for the system: a distracted operator, and a fatigued operator.
They have some functionality in place (especially after the accident) to monitor operator alertness via in-car dashcam and attached analysis.
However, I will not go into these here.
Testing ML Components
For testing the ML components, I discussed various techniques in a previous blog post.
This includes approaches such as metamorphic testing, adversarial testing, and testing with reference inputs.
In autonomous cars, this might be visual classifiers (e.g., convolutional networks),
or path prediction models (recurrent neural nets etc.), or something else.
Testing ML Intensive Systems
As for the set of properties I listed above, it seems once these have been defined, using traditional testing techniques should be quite useful:
- combinatorial testing: combine different objects / actors, with different properties, labels, etc. observe the system behaviour in relation to the set constraints (e.g., safety limits).
- boundary analysis: apply to the combinations and constraints from the previous bullet. for example, probabilities at different values. might require some work to define interesting sets of probability boundaries, or ways to explore the (combined) probability spaces. but not that different in the end from more traditional testing.
- model-based testing: use the above type of variables to express the system state, use a test generator to build extensive test sets that can be used to cover combinations, but also transitions between states and their combinations over time.
- fault-injection testing: the system likely uses data from multiple different data sources, including numerous different types of sensors. different types of faults in these may have different types of impact on the ML classifier outputs, overall system state, etc. fault-injection testing in all these elements can help surface such cases. think Boeing Max from recent history, where a single sensor failure caused multiple crashes with hundreds of lives lost.
The real trick may be in combining these into actual, complete, test scenarios for unit tests, integration tests, simulators, test tracks, and real-world tests.
Regarding the last bullet above (fault-injection testing), the Uber documents discuss this from the angle of fault-injection training – injecting faults into the system and seeing how the vehicle operator reacts to them. Training them how they should react.
This sounds similar to fault-injection testing, and I would expect that they would have also applied the same scenarios more broadly.
However, I could not find mention of this.
Regarding general failures, and when they happen in real use, the same fault models can also be used to prepare and mitigate actual operational faults.
The Uber docs also discuss this viewpoints as the system having a set of identified fault conditions and mitigations when these happen. These are identified by redundant systems and overall monitoring across the system. Example faults:
- Primary compute power failure
- Loss of primary compute or motion planning timeout
- Sensor data delay
- Door open during driving
General Safety Proceduress
Volvo Safety Features
Besides the Uber self-driving technology,
the documents show Volvo cars having safety features in themselves, an Advanced Driver Assistance Systems (ADAS), including an automated emergency braking system named "City Safety".
It contains a forward collision warning system, alerting the driver about imminent collision and
automatically applying brakes when it observes a potentially dangerous situation.
This also includes pedestrian, cyclist, and large animal detection components.
However, these were turned off during autonomous driving mode,
and only active in manual mode.
Simulation tests conducted by the Volvo Group showed how the ADAS features would have been able to avoid the collision (17 times out of 20) or significantly reduce collision speed and impact (remaining 3 times).
In post-crash changes, the ADAS system has been activated at all times (along with many other fixes to all the issues discussed here).
Information Sharing and Other Domains
The documents on reviews and investigations after the accident include comparisons to safety cultures in many other (safety-critical) domains: Nuclear Power, Transportation (Rail), Aviation, Oil and Gas, and Maritime.
While some are quite specific to the domains, and related to higher level process and cultural aspects,
there seem to be many quite interesting points one could build on also for the autonomous driving domain.
Or other similar ones. Safety has many higher level shared aspects across domains.
Regarding my look for testing related aspects, in many cases
replacing "safety" with "QA" would also seem to provide useful insights.
One practical example is how (at least) avionics and transportation (rail) domains have processes in place
to collect, analyze, and share information on unsafe conditions and events observed.
This would seem like a useful way to identify also relevant test scenarios for testing products
in the autonomous driving domain.
Given how much effort is required for extensive collection of such data,
how expensive and dangerous it can be, the benefits seem quite obvious to everyone.
Related to this, Uber discusses shared metrics for evaluating progress of their development.
These include disengagements and self-driving miles travelled.
While they have used these to signal progress both internally and externally,
they also note that such metrics can easily lead to "gaming the system" at the expense of safety or working system.
For example, in becoming overly conservative in avoiding disentanglements,
or in using inconsistent definitions of the metrics across developers / systems.
Uber discusses need for work in creating more broadly usable safety performance metrics with academic and industry partners.
They list how these metrics should be:
- Specific to different development stages (development, testing, deployment)
- Specific to different operational design domains, scenarios and capabilities
- Have comparable metrics for human drivers
- Applied in validation environments and scenarios for autonomous cars with other autonomous cars from different companies
The Uber safety approach document refers also a more general work towards automotive safety framework by the RAND corporation.
This includes topics such as building a shared taxonomy to form a basis for discussion and sharing across vendors. It also discusses safety metrics, their use across vendors, and the possible issues in use and possible gaming of such metrics. And many other related aspects of cross-vendor safety program.
Interesting. Seems like lots of work to do there as well.
This was an overly long look of the documents from the Uber accident.
I was thinking of just looking at the testing aspect briefly, but I guess it is hard to discuss them properly without setting the whole background and overall context.
Overall, the summary is not that complicated.
I just get carried away with writing too much details.
However, I found writing this down helped me reason better about what is the difference between more traditional software intensive systems,
and these types of new machine-learning intensive systems.
I would summarize it as the need to consider everything in terms of probabilities,
the unknown elements in the input and output, constraints over everything, complexity of identifying all the objects and actors, and their possible intents,
and all the relations between all possibilities. With probabilities (or un-certainty).
But once the domain analysis is well done, and understanding the inputs and outputs, I find the traditional testing techniques such as combinatorial testing, model-based testing, category partitioning, boundary analysis, fault-injection testing would give a good basis.
But it might take a bit broader insight to be able to apply them efficiently.
As for the Uber approach, it is interesting.
I previously discussed the Tesla approach of collecting data from fleets of deployed consumer vehicles.
And features such as the Tesla shadow mode, continuously running in the background as the human driver drives,
always evaluating whether each autonomous decision the system would have made would have been similar to what action the human took,
or how it differs from that taken by the actual human driver.
Not specifically trained VO’s as in the Uber case, but usual consumer drivers (so Tesla customers at work helping to improve the product).
The Tesla approach seems much more scalable in general. It might also generalize better as opposed to Uber aiming for very specific routes and building super detailed maps of just those areas.
Creating and maintaining such super-detailed maps seems like a challenging task.
Perhaps if the companies have very good automated tools to take care of it, it can be easier to manage and scale.
I don’t know if Tesla does some similar mapping with the help of their consumer fleet,
but would be interesting to see similar documents and compare.
As for other types of machine learning (intensive) systems, there are many variations, such as those using IoT sensors and data to provide a service.
Those are maybe not as open-worlded in all possible input spaces.
However, it would seem to me that many of the considerations and approaches I discussed here could be applied.
Probabilities, (un-)certainties, domain characterizations, relations, etc.
Remains interesting to see, perhaps I will find a chance to try someday.. 🙂