All posts by Teemu

About Teemu

I am interested in technology research, data engineering, and applied machine learning. Advanced test automation architectures and systems are also in my park. If you have a great offer to make, let go.. :)

Explaining Machine Learning Classifiers with LIME

Machine learning algorithms can produce impressive results in classification, prediction, anomaly detection, and many other hard problems. Understanding what the results are based on is often complicated, since many algorithms are black boxes with little visibility into their inner working. Explainable AI is a term referring to techniques for providing human-understandable explanations of the ML algorithm outputs.

Explainable AI is interesting for many reasons, including being able to reason about the algorithms used, the data we have to train them, and to understand better how to test the system using such algorithms.

LIME, or Local Interpretable Model-Agnostic Explanations is one technique that seems to have gotten attention lately in this area. The idea of LIME is to give it a single datapoint, and the ML algorithm to use, and it will try to build understandable explanation for the output of the ML algorithm for that specific datapoint. Such as "because this person was found to be sneezing and coughing (datapoint features), there is a high probability they have a flu (ML output)".

There are plenty of introductory articles around for LIME, but I felt I needed something more concrete. So I tried it out on a few classifiers and datasets / datapoints to see.

For the impatient, I can summarize LIME seem interesting and going in the right direction, but I still found the details confusing to interpret. It didn’t really make me very confident on the explanations. There seems to be still ways to go for easy to understand, and high confidence explanations.

Experiment Setups

Overview

There are three sections to my experiments in the following. First, I try explaining output from three different ML algorithms specifically designed for tabular data. Seconds, I try explaining the output of a generic neural network architecture. Third, I try a regression problem as opposed to the first two, which examine a classification problem. Each of the three sections uses LIME to explain a few datapoints, each from different datasets for variety.

Inverted Values

As a little experiment, I took a single feature that was ranked as having a high contribution to the explanation for a datapoint by LIME, for each ML algorithm in my experiments, and inverted their value. I then re-ran the ML algorithm and LIME on this same datapoint, with the single value changed, and compared the explanation.

The inverted feature was in each case a binary categorical feature, making the inversion process obvious (e.g, change gender from male to female or the other way around). The point with this was just to see if changing the value of a feature that LIME weights highly results in large changes in the ML algorithm outputs and associted LIME weights themselves.

Datasets and Features

The datasets used in different sections:

  • Titanic: What features contribute to a specific person classified as survivor or not?
  • Heart disease UCI: What features contribute to a specific person being classified at risk of heart disease?
  • Ames housing dataset: What features contribute positively to predicted house price, and what negatively?

Algorithms applied:

  • Titanic: classifiers from LGBM, CatBoost, XGBoost
  • Heart disease UCI: Keras multi-layer perceptron NN architecture
  • Ames housing dataset: regressor from XGBoost

Tree Boosting Classifiers

Some of the most popular classifiers I see with tabular data are gradient boosted decision tree based ones; LGBM, Catboost, and XGBoost. Many others exist that I also use at times, such as Naive Bayes, Random Forest, and Logistic Regression, but LGBM, Catboost, and XGBoost are ones I commonly what I try first these days. So I try using LIME to explain a few datapoints for each of these ML algorithms in this section. I expect a similar evaluation for other ML algorithms should follow a quite similar process.

For this section, I use the Titanic dataset. The goal with this dataset is to predict who would survive the shipwreck and who would not. Its features:

  1. survival: 0 = No, 1 = Yes
  2. pclass: Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
  3. sex: Sex
  4. Age: Age in years
  5. sibsp: number of siblings / spouses aboard the Titanic
  6. parch: number of parents / children aboard the Titanic
  7. ticket: Ticket number
  8. fare: Passenger fare
  9. cabin: Cabin number
  10. embarked: Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

The actual notebook code is available my Github as well as in a Kaggle notebook.

Each of the three boosting models (LGBM, Catboost, XGBoost) provides access to their internal statistics as a form of feature weights. For details, check some articles and documentation. These types of model feature weights provide a more holistic view of the model workings, over all datapoints as opposed to the single datapoint that LIME tries to explain. So in the following, I will show these feature weights for comparison where available.

However, there is also some very good criticism on using these types of classifier internal statistics for feature importances, noting it might also be meaningful to compare with other techniques such as permutation importance and drop-column importance. As such, I calculate also permutation importance for each of the three boosters here, as well as later for the Keras NN classifier.

LGBM

Feature Weights from Classifier / Permutations

The following figure illustrates the weights given by the model itself, via its feature_importances_ attribute.

LGBM Feature Weights

And the ones illustrated by the following figure are the ones given by the SKLearn’s permutation importance function for the same classifier.

LGBM Permutation Weights

Comparing the two above, the model statistics based weights, and the permutation based weights, there is quite a difference in what they rank higher. Something interesting to keep in mind for LIME next.

Item 1

The following figure illustrates the LIME explanations (figures are from LIME itself) for the first item in the test set for Titanic data:

LIME LGBM 1

The figure shows to versions of the same datapoint. The one on the left is the original data from the dataset. The one on the right has the sex attribute changed from opposite gender. This is the invertion of the highly ranked LIME feature I mentioned before.

Compare these LIME visualizations/explanations for the two datapoints above, to the global feature importances I showed a bit higher above (from model internal statistics and permutations). The top features presented by LIME closely match those given by the global permutation importance as top features. In fact, it is almost an exact match.

Beyond that, the left side of the figure illustrates one of my main confusions about LIME in general. The prediction of the classifier for this datapoint is:

  • Not survived: 71% probability
  • Survived: 29% probability

I would expect the LIME feature weights to show highest contributions then for the not survived classification. But it shows much higher weights for survived. By far "Sex=male" seems to be the heaviest weight for any variable, and it is shown as pointing towards survived. Similarly, the overall LIME feature weights in the left hand figure are

  • Not survived: 0.17+0.09+0.03+0.00=0.29
  • Survived: 0.31+0.15+0.07+0.03+0.02+0.01=0.59

Funny how the not survived weights sum up the exact prediction value for survived. I might think I am looking at it the wrong way, but further explanations I tried other datapoints seem to indicate otherwise. Starting with the right part of the above figure.

The right side of the above figure, with the gender inverted, also shows the sex attribute as the highest contibutor. But now, the title has risen much higher. So perhaps it is telling that a female master has a higher change of survival? I don’t know, but certainly the predictions of the classifier changed to:

  • Not survived: 43%
  • Survived: 57%

Similarly, passenger class (Pclass) value has jumped from weighting on survival to weighting on non-survival. The sums of LIME feature weights in the inverted case do not seem too different overall, but the prediction has changed by quite a bit. It seems complicated.

Item 2

LIME explanation for the second datapoint in the test set:

LIME LGBM 2

In this one, the ML prediction for the left side data point seems to indicate even more strongly that the predicted survival chance is low, but the LIME feature weights point even stronger into the opposite direction.

The right side figure here illustrates a bit how silly my changes are (inverting only gender). The combination of female with mr should never happen in real data. Well, likely not in the Titanic times, who knows what people identify as these days.. But regardless of the sanity of some of the value combinations, I would expect the explanation to reflect the prediction equally well. After all, LIME is designed to explain a given prediction with given features, however crazy those features might be. On the right hand side figure the feature weights at least seem to match a bit better to the prediction vs on the left side, but then why is it no always matching in the same way?

An interesting point is also how the gender seems to always weight heavily towards survival in both cases here. Perhaps it is due to the combinatorics of the other feature values, but given how the LIME weights vs predictions seem to vary across datapoints, I wouldn’t be so sure.

Catboost

Feature Weights from Classifier / Permutations

Model feature weights based on model internals:

Catboost Feature Weights

Based on permutations:

Catboost Permutation Weights

Interestingly, parch shows negative contribution.

Item 1

First datapoint using Catboost:

LIME Catboost 1

In this case, both the LIME weights for the left (original datapoint) and right (inverted) side seem to be more in line with the predictions. Which sort of shows that I cannot only blame myself for interpreting the figures wrong, since they sometimes seem to match the intuition, and other times not..

As opposed to the LGBM case/section above, in this case (for Catboost) the top LIME features actually seem to follow almost exactly the feature weights from the model internal statistics. For LGBM it was the other way around, they were not following the internal weights but rather the permutation weights. Confusing as everything else about these weights, yes.

Item 2

The second datapoint using Catboost:

LIME Catboost 2

In this case, LIME is giving very high weights for variables on the side of survived, while the actual classifier is almost fully predicting non-survival. Uh oh..

XGBoost

Feature Weights from Classifier / Permutations

Model feature weights based on model internal statistics:

XGB Feature Weights

Based on permutations:

XGB Permutation Weights

Item 1

First datapoint explained for XGBoost:

LIME XGB 1

In this case, the left one seems to indicate not-survived on the weights quite heavily, but the actual predictions are quite even on survived and not survived. On the right side the weights vs predictions are more in line with LIME feature weights seeming to match prediction.

As for LIME weights vs the global predictions from model internals and permutations, this case they seem to be mixed. So some LIME top features are shared with the top feature weights for model internals, some are shared with permutations. Compared to the previous sections, the LIME weights vs model and permutation weights seem to be all over the place. Which might be some attribute of the algorithms in case of the internal feature weights but I would expect LIME to be more consistent with regards to permutation weights, as that algorithm never changes.

Item 2

Second datapoint:

LIME XGB 2

Here, the left one seems to indicate much more of survival on the weights, and non-survival in actual prediction. On the right side, the weights and predictions seem more in line again.

Explaining a Keras NN Classifier

This section uses a different dataset of the Cleveland Heart Disease risk. The inverted variable in this case is not gender but the cp variable, since it seemed to be the highest scoring categorical binary variable for LIME on the datapoints I looked at.

Features:

  1. age: age in years
  2. sex: (1 = male; 0 = female)
  3. cp: chest pain type (4 values)
  4. trestbps: resting blood pressure in mm Hg on admission to the hospital
  5. chol: serum cholestoral in mg/dl
  6. fbs: fasting blood sugar > 120 mg/dl
  7. restecg: resting electrocardiographic results (values 0,1,2)
  8. *maximum heart rate achieved
  9. *exercise induced angina: (1 = yes; 0 = no)
  10. oldpeak: ST depression induced by exercise relative to rest
  11. slope: the slope of the peak exercise ST segment
  12. number of major vessels (0-3) colored by flourosopy
  13. thal: 3 = normal; 6 = fixed defect; 7 = reversable defect

Feature Weights from Permutations

Keras does not provide feature weights based on model internal statistics, being a generic neural networks framework, as opposed to specific algorithms such as the boosters above. But permutation based feature weighting is always an option:

Keras Permutation Weights

Training Curves

Training curves are always nice, so here you go:

Keras Training

Item 1

First datapoint explained by LIME for Keras:

LIME Keras 1

This one is predicting almost fully no risk on both datapoints. Yet the weights seem to be indicating almost fully on the risk of heart side.

The LIME weights compared to the global permutation weights share the same top 1-2 features, with some changes after.

Item 2

Second datapoint explained by LIME for Keras:

LIME Keras 2

In this case, the predictions and weights are more mixed on both sides. The right side seems to have the weights much more on the no risk side than the left side, yet the change between the two is that the prediction has shifted more towards the risk of heart side.

In this case, the features are quite different from the first datapoint, and also from the global weights given by permutation importance. Since LIME aims to explain single datapoints and not the global model, I don’t see an issue with this. However, I do see an issue in not being able to map the LIME weights to the predictions in any reasonable way. Not consistently at least.

Explaining an XGBoost Regressor

Features in the Ames Housing Dataset used in this section:

  • SalePrice – the property’s sale price in dollars. This is the target variable that you’re trying to predict.
  • Utilities: Type of utilities available
  • OverallQual: Overall material and finish quality
  • GrLivArea: Above grade (ground) living area square feet
  • ExterQual: Exterior material quality
  • Functional: Home functionality rating
  • KitchenQual: Kitchen quality
  • FireplaceQu: Fireplace quality
  • GarageCars: Size of garage in car capacity
  • YearRemodAdd: Remodel date
  • GarageArea: Size of garage in square feet

Item 1

LIME XGBReg 1

As discussed here, LIME results seem more intuitive to reason about for classification than regression. For regression it should show some relative value of how the feature values contribute to the predicted regression value. In this case how the specific features values are predicted to impact the house price.

But as mentioned, the meaning of this is a bit unclear. For example, what does it mean for something to be positively weighted? Or negatively? Regards to what? This would require more investigation, but I will stick on more details for classification in this post.

Data Distribution

Just out of interest, here is a description of the data distribution for the features shown above.

XGBReg Distribution

One could perhaps make some analysis of how the feature value distributions are with regards to the LIME weights for those variables, and use those as a means to analyze the LIME results further in relation to the predicted price. Maybe someday someone will.. 🙂

Conclusions

Compared to all the global feature weights given by the model internal statistics, and the permutations, the results are often sharing some of the top features. And comparing explanations for different datapoints using the same algorithm, there appears to be some changes in which features LIME ranks highest per datapoint. Overall, this all makes sense considering what LIME is supposed to be. Explaining individual data points, where the globally important features likely often (and on average ahould) rank high, but where single points can vary.

LIME in general seems like a good way to visualize feature importances for a datapoint. I like how the features are presented as weighting in one direction vs other. The idea of trying values close to a point to come up with an explanation also seems to make sense. However, many of the results I saw in above experiments do not quite seem to make sense. The weights presented often seem to be opposed to the actual predictions.

This book chapter hosts some good discussion on the limitations of LIME, and maybe it explains some of this. The book chapters ends up says to use great care in applying LIME, and how the LIME parameters impact the results and explanations given. Which seems in line with what I see above.

Also, many of the articles I linked in the beginning simply gloss over the interpretation of the results, whether they make sense, or make seemingly strange assumptions. Such as this one, which gives me the impression that the explanation weights would change depending on what is the higher predicted probability by the classifier. For me, this does not seem to be what the visualizations show.

More useful would be maybe to understand the limitations and not expect it to be too great, even if I feel like I don’t necessarily get all the details. I expect it is either poorly explained, or I did get the details and it is just very limited. This is perhaps coming from the background of LIME itself, where the academics must sell their results as the greatest and best in every case, and put aside their limitations. This is how you get your papers accepted, and cited, leading to more grants, and better tenure positions..

I would not really use LIME. Mostly because I cannot see myself trusting the results very much, no matter what the sales arguments. But overall, it seems like interesting work and perhaps something simpler (to use) will be available someday, where I feel like having more trust in the results. Or maybe the problem is just complicated. But as I said, these all seem like useful steps into the direction of improving the approaches and making them more usable. Along these lines, it is also nice to see these being integrated as part of ML platform offerings and services.

There are other interesting methods for similar approaches as well. SHAP is one that seems very popular, and Eli5 is another. Some even say LIME is a subset of SHAP, which should be more complete vs the sampling approach taken by LIME. Perhaps it would be worth the effort to make a comparison some day..

That’s all for this time. Cheers.

AWS EC2 / RDS Pricing and Performance

AWS EC2 pricing seems compilcated, and many times I have tried to figure it out. Recently I was there again, looking at it from the RDS perspective (same pricing model for the underlying EC2 instances), so here we go.

Amazon/AWS calls their virtual machine Elastic Compute Cloud (EC2). I used to think about the "elastic" part in terms of being able to scale your infrastructure by adding and removing VM’s as needed. But I guess scaling is also relevant in terms of allocating more or less compute on the same VM, or considering how many compute tasks at the same time a single hardware host in AWS can handle. Let’s see why…

Basic Terminology

What is compute in EC2? AWS used to use a term called Elastic Compute Unit (ECU). As far as I can tell, this has been largely phased out. Now they measure their VM performance in terms of virtual CPU (vCPU) units.

So what is a vCPU? At the time of writing this, it is defined as "a thread of either an Intel Xeon core or an AMD EPYC core, except for M6g instances, A1 instances, T2 instances, and m3.medium.". A1 and M6g use AWS Gravitron and Gravitron 2 (ARM) processors, which I guess is a different architecture (no hyperthreading?). T2 is not described in so much detail except as an Intel 3.0 GHz or 3.3 GHz processors (older, no hyperthreading?). Anyway, I go with vCPU meaning a (hyper)thread allocated on an actual CPU host. Usually this would not even be a full core but a hyperthread.

There are different types of vCPU’s on the instances, as they use different physical CPU’s. But that is a minor detail. The instance type are more relevant here:

  • burstable standard,
  • burstable unlimited, and
  • fixed performance.

OK then, what are they?

Fixed Performance

Fixed performance instance type is the simplest. It is always allocated the vCPU’s in full. A fixed performance instance with 2 vCPU instances can run those 2 vCPUs (hyperthreads) at up to 100% CPU load with no extra charge at all times. The price is always fixed. If you don’t need the full 100% CPU power at all times, a burstable instance can be cheaper. But only if you dont "burst" too much, in which case burstable type becomes more expensive.

Burstable Standard

The concept of a burstable instance is what I find a bit complex. There is something called the baseline performance. This is what you always get, and is included in the price.

On top of the baseline performance, for burstable instances, there is something called CPU credits. Different instance types have different number of credits. Here are a few example (at the time of writing this..):

Instance type Credits / h Max. creds. vCPUs Mem. Baseline perf.
T2.micro 6 144 1 1GB 10%
T2.small 12 288 1 2GB 20%
T2.large 36 864 2 8GB 20% * 2
T3.micro 12 288 2 1GB 10% * 2
T3.small 24 576 2 2GB 20% * 2
T3.large 36 864 2 8GB 30% * 2
M4.large 2 8GB 200%

Baseline performance

I will use the T2.micro from above table as an example. The same concepts apply to other instance types as well, just change the numbers.

T2.micro baseline performance is 10%, and there is a single vCPU instance allocated, referring to a single hyperthread. The 10% baseline refers to being able to use 10% of the maximum performance of this hyperthread (vCPU).

CPU credits

Every hour, a T2.micro gets 6 CPU credits. If the instance runs at or below the baseline performance (10% here), it saves these credits for later use. For a maximum of 144 credits saved for a T2.micro. These credits are always awarded, but if your application load is such that the instance can use more than the 10% baseline performance, it will spike to that higher load as soon as the CPU credit is allocated, and consume the credit immediately.

The credit is used up in full if the instance runs at 100%, and in parts if it runs higher than baseline but lower than maximum 100% performance. If multiple vCPUs are allocated to an instance, and they all run on higher than baseline, they will use multipe amounts of the CPU credits.

Well, that is what the sites I linked above say. But here is an example, where I ran a task on a T2.micro instance after it had been practically idle for more than 24 hours. So it should have had the full 144 CPU credits at this point.

T2 load chart

In the above chart, the initial spike around midnight is about 144 minutes, although the chart timeline is too coarse to show it. It is from an RDS T2.micro instance, under heavy write load (I was writing as much as I could all the time, from another EC2 T2.micro instance). So the timeline of 144 minutes seems consistent with the credit numbers. But the CPU percentage shown here is not, since 10% should be the baseline.. uh. It could also be that the EC2 instance responsible for loading the data into the above RDS instance is having the same CPU credit limit and thus the size of data injected for writing is also limited. Will have to investigate more later, but the shape illustrates the performance throttling and CPU credit concepts.

Considering the baseline, practically the T2.micro is an instance running at 10% of a single thread performance of a modern server processor. Does not seem much. To me, the 1 vCPU definition actually seems rather misleading, as you don’t really get a vCPU but rather 10% of one. Given 60 minutes in an hour, and 6 CPU credits awarder to a T2.micro per hour, you get about one credit every 60/6 = 10 minutes. If you save up and run in low performance load for 24 hours (144*10=1440minutes = 24 hours), you can then run for the 144 minutes (2 hours 24 minutes) at 100% CPU load. In spikes of about 10 minutes you can run for one minute equivalent of 100% load.

T2.micro instances are described as "High frequency Intel Xeon processors", with "up to 3.3 GHz Intel Scalable Processor". So the EC2 T2.micro instance is actually 10% of a single hyper-thread on a 3.3GHz processor. About equal to 330Mhz single hyperthread.

The bigger instances can have multiple vCPU’s allocated as shown in the table above. They also get a bit more credits, and have a higher baseline performance %. The performance percentage is per vCPU, so an instance with 2 vCPU’s and a baseline performance of 20% actually has a baseline performance of 2*20%. In this case, You are getting two hyperthreads at 20% of the CPU max capacity.

I still have various questions about this type, such as do you actually use a fraction of the instance CPU credit, or do you use it in full when going over the baseline? Can the different threads (over multiple vCPUs) share the total of 2*20%=40%, or is it just 20% per vCPU and anything above that is over baseline regardless of the other thread idling or not? But I guess I have to settle for burstable complicated, fixed simpler to use. Moving on.

Burstable Unlimited

The burstable instances can also be set to unlimited burstable mode.

In this mode, the instance can run (burst) at full performance all the time, not just limited by accumulated CPU credits. However, you still gain CPU credits as with burstable instances. In comparison to standard bursting type, if you use more CPU credits than you have, with unlimited mode you will just be billed extra for those. You will not be throttled by available credits, rather you can rack up nice extra bills.

If the average utilization rate is higher than baseline + available CPU credits, over a rolling 24-hour window, or during instance lifetime (if less than 24h), you will be billed according to each vCPU hour used over that measurement (baseline average + CPU credits).

Each vCPU hour above the extra billing threshold costs 0.05$ (5 cents USD). Considering the cost difference, this seems potentially quite expensive. Lets see why.

Comparing Prices

What do you actually get for the different instances? I used the following as basis for calculations:

  • T2: 3.0/3.3GHz Xeon. AWS describes T2 instances as T2.small and T2.medium being "Intel Scaleble (Xeon) Processor running at 3.3 GHz", and T2.large at 3.0 Ghz. A bit strange numbers, but I guess there is some legacy there (more cores at less GHz?).
  • T3: 3.1GHz Xeon. AWS describes this as "1st or 2nd generation Intel Xeon Platinum 8000", and "sustained all core Turbo CPU clock speed of up to 3.1 GHz". My interpretation of 3.1 GHz might be a bithigh, as the description says "boost" and "up to", but I don’t have anything better to go with.
  • M5: 3.1GHz Xeon. Desribed same as T3, "1st or 2nd generation Intel Xeon Platinum 8000", and "up to 3.1 GHz"..
Instance type CPU GHz Base perf Instance MHz vCPUs Mem. Price/h
T2.micro 3.3 10% 330 Mhz 1 1GB $0.0116
T2.small 3.3 20% 660 MHz 1 2GB $0.0230
T2.large 3.0 20% * 2 600 MHz * 2 2 8GB $0.0928
T2.large.unl 3.0 200% 3000 MHz * 2 2 8GB $0.1428
T3.micro 3.1 10% * 2 310 MHz * 2 2 1GB $0.0104
T3.small 3.1 20% * 2 620 MHz * 2 2 2GB $0.0208
T3.large 3.1 30% * 2 620 MHz * 2 2 8GB $0.0832
T3.large.unl 3.1 200% 3100 MHz * 2 2 8GB $0.1332
M5.large 3.1 200% 3100 MHz * 2 2 8GB $0.0960

I took the above prices from the AWS EC2 pricing page at the time of writing this. Interestingly, the AWS pricing seems so complicated, they cannot keep track of it themselves. For example, T3 has on price on the above page, and another on the T3 instance page. The latter lists the T3.micro price at $0.0209 / hour as opposed to the $0.0208 above. Yes, it is a minimal difference, but just shows how complicated this gets.

The table above represents the worst-case scenario where you run your instance at 100% performance as much as possible. It also does not include the burstable instances being able to run at up to 100% CPU load for short periods when they accumulate a CPU credit. And with the unlimited burstable types, you can get by with less if you run at or under the baseline. But, as the AWS docs note, the unlimited burstable is about 1.5 times more expensive than the fixed size instace (T3 vs M5).

Strangely, T2 is more expensive than T3, while the T3 is more powerul. So I guess other than free tier use, there should be absolutely no reason to use T2, ever. Unless maybe for some legacy dependency, or limited availability.

Conclusions

I always though it was so nice of AWS to offer a free tier, and how could they afford giving everyone a CPU to play with? Well, it turns out they don’t. They just give you one tenth of a single thread on a hyperthreaded CPU. This is what a T2.micro is in practice. I guess it can be useful for playing around and getting familiar with AWS, but yeah the marketing is a bit of.. marketing? Cute.

Still, the price difference per hour from T2.large ($0.0928) or T3.large ($0.0832) to M5.large ($0.0960) seems small. Especially the difference between the T2 and M5 is so small it seems to make no sense. So why go bursty, ever? With the T3 you are saving about 15%. If you have bursty workloads and need to be able to handle large spikes, on a really large set of servers, maybe it makes sense. Or if your load is very low, you can get smaller (fractions of a CPU) instances using the bursty mode. But seems to me it requires a lot of effort to profile your loads, make predictions, monitor and manage it all.

In most cases I would actually expect something like Lambda functions to be the really best fit for those types of cases. Scaling according to the need, clear pricing (which seems like a miracle in AWS), and a simple operational model. Sounds just great to me.

In the end, comparing the burstable vs fixed performance instances, it just seems silly to me to be paying almost the same price for such a complicated burstable model, with seemingly much worse performance. But like I said, for big houses, and big projects, maybe it makes more sense. Would be really interested to hear some concrete and practical experiences and examples on why use one over the other (especially the bursty instances).

Python Class vs Instance variables

Recently I had the pleasure of learning about Python class vs instance variables. Coming from other programming languages, such as Java, this was quite different for me. So what are they?

I was working on my Monero scraper, so I will just use that as the example, since that is where I had the fun as well..

Class variables

Monero is a blockchain. A blockchain consists of linked blocks, which contain transactions. Each transaction further contains various attributes, most relevant here being tx_in and tx_out type elements. These simply describe actual Monero coins being moved in / out of a wallet in a trasaction.

So I made a Transaction class to contain this information. Like this:

class Transaction:
    fee = None
    block_height = None
    tx_ins: List[TxIn] = []
    tx_outs: List[TxOut] = []

    def __init__(self, fee, block_hight):
        self.fee = fee
        self.block_height = block_height

I figured this should match a traditional Java class like this:

public class Transaction {
    int fee;
    int blockHeight;
    List<TxIn> txIns = new ArrayList();
    List<TxOut> txOuts = new ArrayList();

    public Transaction(int fee, int blockHeight) {
        this.fee = fee;
        this.blockHeigh = blockHeigh;
    }
}

Of course, it turned out I was wrong. A class variable in Python is actually more like a static variable in Java. So, in the above Python code, all the variables in the Transaction class are shared by all Transaction objects. Well, actually only the lists are in this case. But more on that in a bit.

Here is an example to illustrate the case:

t1 = Transaction(1, "aa", 1, 1, 1)
t1.tx_ins.append(TxIn(1, 1, 1, 1))
t2 = Transaction(1, "aa", 1, 1, 1)
t2.tx_ins.append(TxIn(1, 1, 1, 1))

print(t1.tx_ins)
print(t2.tx_ins)

I was expecting the above to print out a list with a single item for each transaction. Since I only added one to each. But it actually prints two:

[<monero.transaction.TxIn object at 0x109ceee10>, <monero.transaction.TxIn object at 0x11141abd0>]
[<monero.transaction.TxIn object at 0x109ceee10>, <monero.transaction.TxIn object at 0x11141abd0>]

There was something missing for me here, which was understanding the instance variables in Python.

Instance variables

So what makes an instance variable an instance variable?

My understanding is, the difference is setting it in the constructor (the __init__()) method:

class Transaction:
    fee = None
    block_height = None
    tx_ins: List[TxIn] = []
    tx_outs: List[TxOut] = []

    def __init__(self, fee, block_hight):
        self.fee = fee
        self.block_height = block_height
        self.tx_ins = []
        self.tx_outs = []

Compared to the previous example, the only difference in the above is that the list values are assigned (again) in the __init__ method. Here is the result of the previous test with this setup:

[<monero.transaction.TxIn object at 0x107447e50>]
[<monero.transaction.TxIn object at 0x108ea2d10>]

So now it works as I intended, each transaction holding its own set of tx_inand tx_out. Since hey became instance variables.

I used the above Transaction structure when scraping the Monero blockchain. Because I originally had the tx_ins and tx_outs initialized as lists at class variable level, adding new values to these lists actually just kept growing the shared (class variable) lists forever. Which was not the intent.

Because I expected each Transaction object to have a new, empty set of lists. Of course, they didn’t, but rather the values just accumulated in the shared (class variable) lists. An I inserted the transactions one at a time into a database, the number of tx_ins and tx_outs for later transactions in the blockchain kept growing and growing, as they now contained also all the values of previous transactions. Hundreds of millions of inserted rows later..

After fixing the variables to be instance variables, the results and counts make sense again.

Gotcha

Even with the above fix to use instance variables for the lists, I still found myself an issue. I typoed the variable name in the constructor:

class Transaction:
    fee = None
    block_height = None
    tx_ins: List[TxIn] = []
    tx_outs: List[TxOut] = []

    def __init__(self, fee, block_hight):
        self.fee = fee
        self.block_height = block_height
        self.tx_inx = []
        self.tx_outs = []

In the above I typoed self.tx_inx instead of self.tx_ins. Because the class level tx_ins is already initialized as an empty list, it gave no errors but the objects kept accumulating as before for the tx_ins part. Brilliant.

So I ended up with the following approach (for now):

class Transaction:
    fee = None
    block_height = None
    tx_ins: List[TxIn] = None
    tx_outs: List[TxOut] = None

    def __init__(self, fee, block_hight):
        self.fee = fee
        self.block_height = block_height
        self.tx_inx = []
        self.tx_outs = []

This way, if I typo the instance variable in the __init__ method, the class variable stays uninitialized, and I will get a runtime error in trying to use it (as the class variable value is None).

The main reason I am doing the variable initialization like I am here, is to get the variables defined for IDE autocompletion, and to be able to add type hints to them, for further IDE assistance and checking. There might be other ways to do it, but this is what I figured so far..

When I was looking into this, I also found this Stack Overflow post on the topic. It points to other optional ways to specify type hints for instance variables (e.g., typing the parameters to the constructor). There is also some pointer to Python Enhancement Proposal PEP526. Which references other PEP’s, but let’s not go into all of those..

I cannot say to have 100% assurance of all possibilities related to these annotation, and instance vs class variables, but I think I got a pretty good idea.. If you have any pointers to what I missed or misinterpreted, please leave a comment 🙂

DevOps – What does it mean?

These days people talk a lot about "DevOps". Well, in IT they do. But the definition of DevOps seems quite elusive to me. Originally I though about it as just Development and Operations working closer together. As in DEVelopment and OPerationS. Of couse, this is a quite vague definition, as I have noticed. Over time, I have seen many quite different definitions.

The most popular definition I seem to come across is one where people call a tools (or platform?) team a "DevOps" team. This is a team running the Continous Integration (CI)-systems, and providing some common development (and test) infrastructure. Or any other type of development platform service, such as software based infrastructure setup (containers, clouds, …).

I have also seen roles where people call themselves "DevOps Engineers", and they are really writing the product code, building it, testing it, operating it, … So pretty much doing everything you can do. Must be nice for the employer. Well, as long as you enjoy your job..

In some places there is the development team, and then there is a "DevOps" team working on infrastructure setup scripts using tools like Terraform. Which, BTW, is a nice tool like most from Hashicorp. At the same time, the dev team might be running all their code on whatever setup they have available, typically on their laptops using some combinations of Docker and whatnot.

This seems like a large disconnect to me. But on a second thought, maybe it makes sense (to some extent). You have to develop it somewhere, so you need some setup locally. This makes me think, it might be useful to have some model of a DevOps lifecycle, like what points in a project lifecycle do the Dev and Ops benefit most from collaboration, and what kind of collaboration is best suited at different times? Where do the QA and security best fit in?

Overall, it seems there is no single right definition for DevOps. Whatever works for you I guess. Except that many could learn a lot from others and improving, if they realized enough to have an open mindset. And if they supported people in initatives to learn and improve related processes, tools, techniques, … (..)

Regarding such different approaches, and benefits and problems, I find the DevOps Topologies website has some good definitions for both "good" DevOps and anti-DevOps teams. Their definition of anti-DevOps seems to be largely about keeping the Dev and Ops separate but still calling it "DevOps". Because it’s trendy I guess. Somehow this does not surprise me..

The types listed as working better on DevOps Topologies on the other hand seem to be focused on building more overlap between the Dev and Ops. I guess it depends on the type of work and organization that is in question. There is something that seems related in Team Topologies but the website is a bit vague. Maybe I should get the book, but then I have an overly long reading list already. And somehow manage to distract myself from reading.. I used to have more chance for reading when I had some business trips, with less distractions on a plane, hotel, etc. But I digress.

I find I got the best idea of what DevOps is from reading the book The Phoenix Project. It is from 2013, so already about 7 years old at the time of this writing, but every description in it seems perfectly fine for today. Much like the 20+ years old Office Space movie, where you could just change monitors to be flatter, and software UI’s a bit flashier. The main part of office politics would not change, much like it seems to stay the same for DevOps related development processes. But I digress again.

The Phoenix Project felt like a long story to get started, but in the end it gave me a great perspective on what the term might originally have been intended to mean. I interpret it as development working closely with operations (and testing+security) to make sure they share the exact same infrastructure (Terraform etc today), as well as QA sharing it.

Overall, I also interpret it to indicate dev not creating their own setups and throwing stuff over the wall to Ops and QA. Rather all working closely together to figure out issues, build extensive monitoring and logging to make all their lives easier, improve everything overall. Make it all work better and more reliable for the customer and for yourself (less need to get up at 4am (for this..)). And so on.

For me the Phoenix Project story delivered the message much better than all the websites and powerpoints with their diagrams and abstract descriptions. I guess I prefer stories that make things concrete with realistic examples. And yet, as I discussed above, there still seem to be many quite different definitions as well. I guess with something becoming popular this happens, and maybe for different systems and organizations a different approach with the same higher level goal works. I am sure there are plenty expensive consultants for all this with better answers than me :).

So to summarize my brief lamentations here, a few points:

  • DevOps seems to vary quite a bit across organizations, both in how they do it, and what might be a suitable model for them.
  • There seem to be many ways to do DevOps "wrong". Which I guess means just not getting optimal benefit from it.
  • I would be interested to understand better how all this relates to the different software engineering lifecycle. Early development, adding new features, maintenance, …
  • Stories on how Dev, Ops, QA, Security, and anything else have successfully worked together in different companies and software projects would be great to hear.

That’s all for today.

Leave some comments now. Like what do you think DevOps is? And what did I say all wrong? 🙂

Testing Machine Learning Intensive Systems (or Self-Driving Cars) – A Look at the Uber Accident

Previously I looked at what it means to test machine learning systems, and how one might use machine learning in software testing. Most of the materials I found on testing machine learning systems was academic in nature, and as such a bit lacking in practical views. Various documents on the Uber incident (fatally hitting a pedestrian/cyclist) have been published, and I had a look at those documents to find a bit more insights into what it might mean to be testing overall systems that rely heavily on machine learning components. I call them machine-learning intensive systems. Because.

Accident Overview

There are several articles published on the accident, and the information released for it. I leave the further details and other views for those articles, while just trying to find some insights related to the testing (and development) related aspects here. However, a brief overview is always in order to set the context.

This accident was reported to have happened on 18th of March, 2018. It involved the Uber test car (a modified Volvo X90) hitting a person walking with a bicycle. The person died as a result of the impact. This was on a specific test route that Uber used for their self-driving car experiments. There was a vehicle operator (VO) person in the Uber car, whose job was to oversee the autonomous car performance, mark any events of interest (e.g., road debris, accidents, interesting objects), label objects in the recorded data, and take over control of the vehicle in case of emergency or other need (system unable to handle some situation).

The records indicate there used to be two VO’s per car, one focusing more on the driving, and one more on recording events. They also indicate that before the accident, the number of VO’s had been reduced to just one. The roles were combined and an update of the car operating console was designed to address the single VO being able to perform both roles. The lone VO was then expected to keep a constant eye on the road, monitor everything, label the data, and perform any other tasks as needed. Use of mobile phone was prohibited during driving by the VO, but the documents for the accident indicate the VO had been eyeing the spot where their mobile device was located, inside a slot on the dashboard. The documents also indicate the VO had several video streaming applications installed, and records from the Hulu streaming service showed video streaming occurring on the VO account at the time of the accident.

The accident itself was a result of many combined factors, where the human VO seems to have put their attention elsewhere at just the time, and the automation system has failed to react properly to the pedestrian / cyclist. Some points on the automation system in relation to potential failures:

  • The system kept records of each moving object / actor and their movement history, using their previous movements and position as an aid to predict their future movements. The system was further designed to discard all previous movement (position) history information when it changed the classification of an object. So no history was available to predict movement of an object / actor, if its classification changed.
  • The classification of the pedestrian that was hit kept changing multiple times before the crash. As a result, the system constantly discarded all information related to it, severely inhibiting the system from predicting the pedestrians movement.
  • The system had an expectation to not classify anything outside a road crossing as a pedestrian. As such, before the crash, the system continously changed the pedestrian classification between vehicle, bicycle, or other. This was the cause of losing the movement history. The system was not designed for the possibility of someone walking on the road outside a crossing area.
  • The system had safeguards in place to stop it from reacting too aggressively. A delay of 1 second was in place to delay braking when a likely issue was identified. This delayed automatic braking even at a point where a likely crash was identified (as in the accident). The reasoning was to avoid too aggressive reactions to false positives. I guess they expected the VO to react, and to log issues for improvement.
  • Even when the danger was identified and automated braking started, it was limited to reasonable force to avoid too much impact on the VO. If this maximum braking was calculated as insufficient to avoid impact, the system would brake even less and emit an audio signal for the VO to take over. So if maximum is not enough, slow down less(?). And the maximum emergency braking force was not set very high (as far as I understand..).

Before the crash, the system thus took an overly long time to identify the danger due to bad assumptions (no pedestrian outside crossing). It lost pedestrian movement history due to dropping data on classification change. It waited for 1 second from finally identifying the danger to do anything, and then initiated a slowdown rather than emergency braking. And the VO seemed to be distracted from observing the situation. After the accident, Uber has moved to address all these issues.

There are several other documents on various aspects of the VO, the automation system, and the environment available on the National Transportation Safety Board website for those interested. Including nice illustrations of all aspects.

This was a look at the accident and its possible causes to give some context. Next a look at the system architecture to also give some context of potential testing approaches.

Uber System Architecture

Looking at testing in any domain, understanding the system architecture is important. A look.

Software Modules

The Uber document on the topic lists the following main software modules:

  • Perception: Collects data from different sensors around the car
  • Localization: Combines detailed map data with sensor data for accurate positioning
  • Prediction: Takes Perception output as input, predicts actions for actors and objects in the environment.
  • Routing and Navigation: Uses map data, vehicle status, operational activity to determine long term routes for a given goal.
  • Motion Planning: Generates shorter term motion plans to control the vehicle in the now. Based on Perception and Prediction inputs.
  • Vehicle Control: Executes the motion plan using vehicle communication interfaces.

Hardware

The same Uber document also describes the self-driving car hardware.

The current components at the time of writing the document:

  • Light Detection and Ranging (LIDAR): Measuring distance to actors and objects, 100m+ range.
  • Cameras: multiple cameras for different distances, covering 360 degrees around the vehicle. Both near- and far-range. To identify people and objects.
  • Radar: Object detection, ranging, relative velocity of objects. Forward-, backward-, and side-facing.
  • Global Positioning System (GPS): Coarse position to support vehicle localization (positioning it), vehicle command (to use location / position for control), map data collection, satellite measurements.
  • Self-Driving Computer: A liquid-cooled local computer in the car to run all the SW modules (Perception, Prediction, Motion Planning, …)
  • Telematics: Communication with backend systems, cellular operator redundancy, etc.

Planned components (not installed back then, but in future plans..):

  • Ultrasonic Sensors: Uses echolocation to range objects. Front, back, and sides.
  • Vehicle Interface Module: Seems to be an independent backup module to safely control and stop the vehicle in case of autonomous system faults.

Functionality

Now that we established a list of the SW and HW components, a look at their functionality.

Mapping

The system is described as using very detailed maps, including:

  • Geometry of the road and curbs
  • Drivable surface boundaries and driveways
  • Lane boundaries, including paint lines of various types
  • Bike and bus lanes, parking regions, stop lines, crosswalks
  • Traffic control signals, light sets, and lane and conflict associations (whatever that is? :))
  • Railroad crossings and trolley or railcar tracks
  • Speed limits, constraint zones, restrictions, speed bumps
  • Traffic control signage

Combined with precise location information, the system uses these detailed maps to beforehand "predict" what type of environment lies ahead, even before the Perception module has observed it. This is used to prepare for the expected road changes, anticipate speed changes, and optimize for expected motion plans. For example, when anticipating a tight turn in the road.

Perception and Prediction

The main tasks of the Perception module are described as detecting the environment, actors and objects. It uses sensor data to continously estimate the speed, position, orientation, and other variables of the objects and actors, as a basis to make better predictions and plans about their future movement, velocity, and position.

An example given is the turn signals of other cars, which is used to predict the their actions. At the same time, all the other data is also recorded and used to predict other, alternative courses for the same car, in case it does not turn even though using a turning signal.

While the Perception module observes the environment (collects sensor data), the Prediction component uses this, and other available, data as a basis for predicting the movement of the other actors, and changes in the environment.

The observed environment can have different types of objects and actors in it. Some are classified as fixed structures, and are expected not to move: buildings, ground, vegetation. Others are classified as more dynamic actors, and expected to move: vehicles, pedestrians, cyclists, animals.

The Prediction module makes predictions on where each of these objects is likely to move in the next 10 seconds. The predictions include multiple properties for each object and actor, such as movement, velocity, future position, and intended goal. The intended goal (or intention) is mentioned in the document, but I did not find a clear description of how this would be used. In any case, it seems plausible that the system would assign "intents" to objects, such as pedestrian crossing a street, a car turning, overtaking another car, going straight, and so on. At least these would seem useful abstractions and input to the next processing module (Motion Planning).

The Prediction module makes predictions multiple times a second to keep an updated representation available. The predictions are provided as input to the Route and Motion Planning module, including the "certainty" of those predictions. This (un)certainty is another factor that the Motion Planning module can use as input to apply more caution to any control actions.

Route and Motion Planning

Motion Planning (as far as I understand) refers to short-term movements, translating to concrete control instructions for the car. Route planning on the other hand refers to long term planning on where to go, and gives goals for the Motion Planning to guide the car to the planned route.

Motion Planning combines information from generated route (Route Planning), perceived objects and actors (Perception), and predicted movements (Prediction). Mapping data is used for the "rules of the road", as well as any active constraints. I guess also combined with sensor data for more up-to-date views in the local environment (the public docs are naturally not super-detailed on everything). Using these, it creates a motion plan for vehicle. Data from Perception and Prediction modules is also used as input, to define the anticipated movements of other objects and actors.

A spatial buffer is defined to be kept between the vehicle and other objects in the environment. My understanding is that this refers to keeping some amount of open space between the car and environmental elements. The size of this buffer varies with variables such as autonomous vehicle speed (and properties and labels of other objects and actors I assume). To preserve the required buffer, the system may take action such as changing lanes, brake, or stop and wait for situation to clear.

The system is also described as being able to identify and track occlusions in the environment. These would be environmental elements, such as buildings or other cars, blocking a view to certain other parts of the environment. These are constantly reasoned about, and the system becomes more concervative in decision when occlusions are observed. It aims to be able to avoid actors coming out of occlusions with reasonable speed.

Vehicle Control

The Vehicle Control module executes trajectories provided by the Motion Planning module. It controls the vehicle through communication interfaces. Example controls include steering, braking, turn signals, throttle, and switching gears.

It also tracks any limits set for the system (or environment?), and communicates back to the operation center as needed.

Data Collection and Test Scenarios

Since my point with this "article" was to look into what it might mean to test a machine learning intensive system, I find it important to also look at what type of data is used to train the machine learning systems, and how is all the used data collected. And how these are used as part of test case (In the Uber documents they seems to call them test scenarios). Of course, such complex systems use this type of data for many different purposes besides just the machine learning part, so they are generally interesting as well.

The Uber document describes data uses including system performance analysis, quality assurance, machine teaching and testing, simulated environment creation and validation, software development, human operator training and assessment, and map building and validation.

Data Collection

Summarizing the various parts related to data collection and synthesis from the Uber descriptions, at the heart of all this are the real-world training data collected by the VO’s driving around, the car and automated sensors collecting detailed data, and the VO’s tagging the data. This tagging also helps further identify new scenarios, objects, and actors. The sensor data is based on the sensors I listed above in the HW section.

Additionally, the system is listed as recording:

  • telemetry (maybe refers to metrics about network? or just generally to transferring data?)
  • control signals (commands for vehicle control?)
  • Control Area Network (CAN) messages
  • system health, such as
    • hard drive speeds
    • internal network performance
    • computer temperatures

The larger datasets are recorded in onboard (car) storage. Smaller amounts of data are transmitted in near real-time using over-the-air (OTA) interfaces over cellular networks to the Uber control center. These use multiple cellular network for cybersecurity and resiliency purposes. The OTA data includes insights on how the vehicles are performing, where they are, and their current state.

Scenario Development

In the documents (Uber and another from the RAND corporation), the operational environment of the autonomous vehicle is referred to as the operational design domain (ODD). Defining the ODD is quite central to the development (as well as testing) of the system and training the ML algorithms, as well as the controlling logic based on those. It defines the world in which the car operates, and all the actors and objects, and their relations.

The Uber document describes using something called scenarios as test cases. Well, it mostly does not mention the word "test case", but for practical purposes this seems to be similar. Of course, this is quite a bit more complex than a traditional software test case with simple inputs and outputs, requiring description of complex real-world environments as inputs, and boundaries and profiles of accepted behaviour as outputs, rather than specific data values. These complex real-world inputs and outputs are also varying over time, different from the typical static input values as often is with traditional software tests. Thus, also a time-series aspect is relevant to the inputs and outputs.

Uber describes a unified schema being used to describe the scenarios and data. Besides the collected data and learned models, other data inputs are also used, such as operational policies. Various success criteria are defined for each scenario, such as speed, distance, and description of safe behaviour.

When new actors, environmental elements, or other similar items are encountered, they are recorded and tagged for further training of the autonomous system. The resulting definitions and characterization of the ODD is then used as input to improve the test scenarios and create new ones. This includes improving the test simulations, and test tracks for coverage.

Events such as large deviations between consequtive planned trajectories are recorded and automatically tagged for investigation. Simulations are used to evaluate they are fixed, and the new scenarios are added to ML training datasets, or as "hard test cases". This seems a bit similar to the Tesla "shadow mode" I discussed earlier, just a bit more limited.

Test Coverage

Besides a general overview of the scenario development, the Uber documents do not really discuss how they handle test coverage, or what types of tests they run. There are some minor references but nothing very concrete. It is more focused on describing the overall system, and some related processes. I tried to collect here some points that I figured seemed relevant.

A key difference to more traditional software systems seems to be how these types of systems do not have a clearly defined input or output space. The interaction interfaces (API/GUI) of traditional software systems naturally defines some contract for what type of input is allowed, and expected. With these it is possible to apply the traditional techniques such as category partitioning, boundary analysis, etc. When your input space is the real world and everything that can happen in it, and output space is all the possible actions in relation to all the possible environmental configurations, it gets a bit more complex. In a similar comment, Uber describes their system as requiring more testing with different variations.

Potential Test Scenarios from Uber Docs

These are just points I collected that I though would illustrate something related to test scenarios and test coverage.

Uber describes evaluating their system performance in different common and rare scenarios, using measurements such as traffic rule violations, and vehicle dynamic attributes. This means having very few crash and unsafe scenarios available, but a large number of safe scenarios. That is, when the scenarios are based on real-world use and data, commonly there are much more "safe" scenarios available than "un-safe", due to rarity of crashes, accidents and other problem cases vs normal operations.

With only this type of highly biased data-set available, I expect there is a need to synthesize more extensive test sets, or other methods to test and develop such systems more extensively. The definition of safety also does not seem to be a binary decision but rather there can be different scales of "safe", depending on the safety attribute. For example, a safety margin of how far from other vehicles should the autonomous vehicle keep distance, is a continous variable, not a binary value. Some variables might of course have binary representations, such as avoiding hitting a pedestrian, or ramming a red light. But even the pedestrian metric may have similar distance measures, impact measures, etc. So I guess its a bit more complicated than just safe or not safe.

Dataset augmentation and imbalanced datasets are common issues in developing and training ML models. However, those techniques are (to my understanding) based on a single clear goal such as classification of an object, not on complex output such as overall driving control and its relation to the real world. Thus, I would expect to use overall scenario augmentation type of approaches, more holistic than a simple classifier (which in its own might be part of the system).

Some properties I found in the Uber documents (as I discussed above), referring to potential examples of test requirement:

  • Movement of objects in relation to vehicle.

  • Inability of the system to classify a pedestrian correctly if not near a crosswalk.

  • Inability of the system to predict pedestrian path correctly when not classified as pedestrian.

  • Overly strict assumptions made, such as cyclist not moving across lanes.

  • Losing location history of tracked objects and actors if their classification changed

  • Uber defines test coverage requirements based on collected map data and tags.

  • Map data predictin that upcoming environment would be of specific type (e.g., left curve), but it has changed and observations differ

  • Another car signals turning left but other predictors do not predict that, and the other car may not actually turn left.

  • Certainty of classifications.

  • Occlusions in the environment.

Abstracting

Looking at the above examples, trying to abstract some more generic concepts that would serve as a potentially useful basis:

  • Listing of known objects / actors

  • Listing of labels for different types of objects / actors

  • Assumptions made about specific types of objects / actors

  • Properties of objects / actors

  • Interaction constraints of objects and actors

  • Probabilities of classifications for different objects / actors and labels

  • Functionality when faced with unknown objects / actors

The above list may be lacking in general details that would cover the different types of systems, or even the Uber example, but I find it provides an insight into how this is heavily about probabilities, uncertainty, and preparing for, and handling, that uncertainty.

For different types of systems, the actual objects, actors, labels and properties would likely change. To illustrate these a bit more concretely with the autonomous car example:

  • Objects / Actors, and Properties their Labels

    • Our car
      • Speed, Position, Orientation,
      • Accelerating, Slowing down,
      • Intended goal (turn left, drive forward, change lane, stop, …)
      • Predicted location in 1s, 2s, 5s, …
      • Distance to all other actors / objects
      • Right of way
    • Other car, moving
      • Same as "Our car"
    • Other car, parked
      • Probability of leaving parking mode
    • Pedestrian, moving or stopped (parked)
      • Same as "other car"
      • Crossing street
      • On pedestrian path
    • Cyclist
      • Same as "other car"
      • Crossing street
      • On bicycle path
    • Other object
      • Moving or static
      • Same as others above
    • Traffic light
      • Current light (green, yellow, red)
      • On / off / blinking
    • Traffic sign
      • Type / Meaning
        • Set speed, stop, yield, no parking, …
        • Long term / local effect
    • Building
      • Size, shape, location
    • Occlusion
      • Predicted time of object / actor coming from occlusion
    • Unknown object, moving or parked
      • Much like the other car etc but maybe with unknown goals
  • Interaction constraints

    • Safety margin (distance to our car and other actors) before triggering some action
    • Actions triggered in different constraint states / boundaries

Something that seems important is also the ability to reason about previously unknown objects and actors to an extent possible. For example, a moving object that does not seem to fit any known category, but has known movement history, speed, and other variables. Perhaps there would be a more abstract category of a moving object, or some hierarchy of such categories. As well as the any of these objects or actors changing their classifications and goals, and how their long-term history should be taken into account overall to make future predictions.

In a different "machine learning intensive" system (not autonomous cars), one might use different set of properties, actors, object, etc. But it seems some similar consideration could be useful.

Possible Test Strategies

Once the domain (the "ODD") is properly defined, as above, it seems many traditional testing techniques could be applied. In the Uber documents, they describe performing architecture analysis to identify all potential failure points. They divided faults into three levels: faults in the self driving system on its own, faults in relation to the environment (e.g., at intersections), and faults related to the operational design domain, such as unknown obstacles the system does not recognize (or misclassifies?). This could be another way to categorize a more specific system, or inspiration for other similar systems.

Another part of this type of system could be related to the human aspect. This is somewhat discussed also in the Uber docs, in relation to operational situations for the system: a distracted operator, and a fatigued operator. They have some functionality in place (especially after the accident) to monitor operator alertness via in-car dashcam and attached analysis. However, I will not go into these here.

Testing ML Components

For testing the ML components, I discussed various techniques in a previous blog post. This includes approaches such as metamorphic testing, adversarial testing, and testing with reference inputs. In autonomous cars, this might be visual classifiers (e.g., convolutional networks), or path prediction models (recurrent neural nets etc.), or something else.

Testing ML Intensive Systems

As for the set of properties I listed above, it seems once these have been defined, using traditional testing techniques should be quite useful:

  • combinatorial testing: combine different objects / actors, with different properties, labels, etc. observe the system behaviour in relation to the set constraints (e.g., safety limits).
  • boundary analysis: apply to the combinations and constraints from the previous bullet. for example, probabilities at different values. might require some work to define interesting sets of probability boundaries, or ways to explore the (combined) probability spaces. but not that different in the end from more traditional testing.
  • model-based testing: use the above type of variables to express the system state, use a test generator to build extensive test sets that can be used to cover combinations, but also transitions between states and their combinations over time.
  • fault-injection testing: the system likely uses data from multiple different data sources, including numerous different types of sensors. different types of faults in these may have different types of impact on the ML classifier outputs, overall system state, etc. fault-injection testing in all these elements can help surface such cases. think Boeing Max from recent history, where a single sensor failure caused multiple crashes with hundreds of lives lost.

The real trick may be in combining these into actual, complete, test scenarios for unit tests, integration tests, simulators, test tracks, and real-world tests.

Regarding the last bullet above (fault-injection testing), the Uber documents discuss this from the angle of fault-injection training – injecting faults into the system and seeing how the vehicle operator reacts to them. Training them how they should react. This sounds similar to fault-injection testing, and I would expect that they would have also applied the same scenarios more broadly. However, I could not find mention of this.

Regarding general failures, and when they happen in real use, the same fault models can also be used to prepare and mitigate actual operational faults. The Uber docs also discuss this viewpoints as the system having a set of identified fault conditions and mitigations when these happen. These are identified by redundant systems and overall monitoring across the system. Example faults:

  • Primary compute power failure
  • Loss of primary compute or motion planning timeout
  • Sensor data delay
  • Door open during driving

General Safety Proceduress

Volvo Safety Features

Besides the Uber self-driving technology, the documents show Volvo cars having safety features in themselves, an Advanced Driver Assistance Systems (ADAS), including an automated emergency braking system named "City Safety". It contains a forward collision warning system, alerting the driver about imminent collision and automatically applying brakes when it observes a potentially dangerous situation. This also includes pedestrian, cyclist, and large animal detection components. However, these were turned off during autonomous driving mode, and only active in manual mode. Simulation tests conducted by the Volvo Group showed how the ADAS features would have been able to avoid the collision (17 times out of 20) or significantly reduce collision speed and impact (remaining 3 times). In post-crash changes, the ADAS system has been activated at all times (along with many other fixes to all the issues discussed here).

Information Sharing and Other Domains

The documents on reviews and investigations after the accident include comparisons to safety cultures in many other (safety-critical) domains: Nuclear Power, Transportation (Rail), Aviation, Oil and Gas, and Maritime. While some are quite specific to the domains, and related to higher level process and cultural aspects, there seem to be many quite interesting points one could build on also for the autonomous driving domain. Or other similar ones. Safety has many higher level shared aspects across domains. Regarding my look for testing related aspects, in many cases replacing "safety" with "QA" would also seem to provide useful insights.

One practical example is how (at least) avionics and transportation (rail) domains have processes in place to collect, analyze, and share information on unsafe conditions and events observed. This would seem like a useful way to identify also relevant test scenarios for testing products in the autonomous driving domain. Given how much effort is required for extensive collection of such data, how expensive and dangerous it can be, the benefits seem quite obvious to everyone.

Related to this, Uber discusses shared metrics for evaluating progress of their development. These include disengagements and self-driving miles travelled. While they have used these to signal progress both internally and externally, they also note that such metrics can easily lead to "gaming the system" at the expense of safety or working system. For example, in becoming overly conservative in avoiding disentanglements, or in using inconsistent definitions of the metrics across developers / systems.

Uber discusses need for work in creating more broadly usable safety performance metrics with academic and industry partners. They list how these metrics should be:

  • Specific to different development stages (development, testing, deployment)
  • Specific to different operational design domains, scenarios and capabilities
  • Have comparable metrics for human drivers
  • Applied in validation environments and scenarios for autonomous cars with other autonomous cars from different companies

The Uber safety approach document refers also a more general work towards automotive safety framework by the RAND corporation. This includes topics such as building a shared taxonomy to form a basis for discussion and sharing across vendors. It also discusses safety metrics, their use across vendors, and the possible issues in use and possible gaming of such metrics. And many other related aspects of cross-vendor safety program. Interesting. Seems like lots of work to do there as well.

Conclusions

This was an overly long look of the documents from the Uber accident. I was thinking of just looking at the testing aspect briefly, but I guess it is hard to discuss them properly without setting the whole background and overall context. Overall, the summary is not that complicated. I just get carried away with writing too much details.

However, I found writing this down helped me reason better about what is the difference between more traditional software intensive systems, and these types of new machine-learning intensive systems. I would summarize it as the need to consider everything in terms of probabilities, the unknown elements in the input and output, constraints over everything, complexity of identifying all the objects and actors, and their possible intents, and all the relations between all possibilities. With probabilities (or un-certainty). But once the domain analysis is well done, and understanding the inputs and outputs, I find the traditional testing techniques such as combinatorial testing, model-based testing, category partitioning, boundary analysis, fault-injection testing would give a good basis. But it might take a bit broader insight to be able to apply them efficiently.

As for the Uber approach, it is interesting. I previously discussed the Tesla approach of collecting data from fleets of deployed consumer vehicles. And features such as the Tesla shadow mode, continuously running in the background as the human driver drives, always evaluating whether each autonomous decision the system would have made would have been similar to what action the human took, or how it differs from that taken by the actual human driver. Not specifically trained VO’s as in the Uber case, but usual consumer drivers (so Tesla customers at work helping to improve the product).

The Tesla approach seems much more scalable in general. It might also generalize better as opposed to Uber aiming for very specific routes and building super detailed maps of just those areas. Creating and maintaining such super-detailed maps seems like a challenging task. Perhaps if the companies have very good automated tools to take care of it, it can be easier to manage and scale. I don’t know if Tesla does some similar mapping with the help of their consumer fleet, but would be interesting to see similar documents and compare.

As for other types of machine learning (intensive) systems, there are many variations, such as those using IoT sensors and data to provide a service. Those are maybe not as open-worlded in all possible input spaces. However, it would seem to me that many of the considerations and approaches I discussed here could be applied. Probabilities, (un-)certainties, domain characterizations, relations, etc. Remains interesting to see, perhaps I will find a chance to try someday.. 🙂

Remote Execution in PyCharm

Editing and Running Python Code on a Remote Server in PyCharm

Recently I was looking at an option to run some code on a remote server, while editing it locally. This time on AWS, but generally ability to do so on any remote server would be nice. I found that PyCharm has this nice option to use a Python SSH interpreter. Give it some SSH credentials, and point it to the Python interpreter on the remote machine, and you should be ready to go. Nice pic about it:

Overview

Sounds cool, and actually works really well. Even supports debugging. A related issue I ran into for pipenv also mentions profiling, pip package management, etc. Great. No, I haven’t tried all the advanced stuff yet, but at least the basics worked great.

Basic Remote Use

I made this simple program to test this feature:

print("hello world")
with open("bob.txt", "w") as bob:
    bob.write("hello.txt")

print("oops")

The point is to print text to the console and create a file. I am looking to see that running this remotely will show me the prints locally, and create the file remotely. This would confirm to me that the execution happens remotely, while I edit, control execution, and see the results locally.

Running this locally prints "hello world" followed by "oops" and a file named "hello.txt" appears. Great.

To try remotely, I need to set up a remote Python interpreter in PyCharm. This can be done via project preferences:

Add interpreter

Or by clicking the interpreter in the status bar:

Statusbar interpreter

On a local configuration this shows the Python interpreter (or pipenv etc.) on my computer. In remote configuration it asks for many options such as remote server IP and credentials. All the run/debugging traffic between local and remote machines is then automatically transferred over SSH tunnels by PyCharm. To start, select SSH interpreter as type when adding new interpreter:

SSH interpreter

Just enter the remote IP/URL address, and username. Click next to enter also password/keyfile. PyCharm will try to connect and see this all works. On the final page of the remote interpreter dialog, it asks for the interpreter path:

Remote Python config

This is referring to the python executable on the remote machine. A simple which python3 does the trick. This works to run the code using the system python on the remote machine.

To run this remote configuration, I just press the run button as usual in PyCharm. With this, PyCharm uploads my project files to the remote server over SSH, starts the interpreter there for the given configuration, and transports back to my local host the console output of the execution. For me it looks exactly the same as running it locally. This is the output of running the above configuration:

ssh://ec2-user@18.195.211.65:22/usr/bin/python3 -u /tmp/pycharm_project_411/hello_world.py
hello world
oops

The first line shows some useful information. It shows that it is using the SSH interpreter with the given IP and username, with the configured Python path. It also shows the directory where it has uploaded my project files. In this case it is "/tmp/pycharm_project_411". This is the path defined in Project Interpreter settings in the Path Mappings part, as illustrated higher above in image (with too many red arrows) in this post. OK, the attached image further above has a different number due to playing with different projects but anyway. To see the files and output:

[ec2-user@ip-172-31-3-125 ~]$ cd /tmp/pycharm_project_411/
[ec2-user@ip-172-31-3-125 pycharm_project_411]$ ls
bob.txt  hello_world.py

This is the file listing from the remote server. PyCharm has uploaded the "hello_world.py" file, since this was the only file I had in my project (under project root as configured for synch in path mappings). There is a separate tab on PyCharm to see these uploads:

Remote synch

After syncing the files, PyCharm has executed the configuration on the remote host, which defined to run the hello_world.py file. And this execution has create the file "bob.txt" as it should (on remote host). The output files go in this remote target directory, as it is the working directory for the running python program.

Another direction to synchronize is from the remote host to local. Since PyCharm provides intelligent coding assistance and navigation on local system, it needs to know and install the libraries used by the executed code. For this reason, it installs all the packages installed in the remote host Python environment. Something to keep in mind. I suppose it must install some type of a local virtual environment for this. Haven’t needed to look deeper on that yet.

Using a Remote Pipenv

The above discusses the usage of standard Python run configuration and interpreter. Something I have found useful for Python environemnts is pipenv.

So can we also do a remote execution of a remote pipenv configuration? The issue I linked earliner contains solutions and discussion on this. Basically, the answer is, yes we can. Just have to find the pipenv files on the remote host and configure the right one as the remote interpreter.

For more complex environments, such as those set up with pipenv, a bit more is required. The issue I linked before had some actual instructions on how to do this:

Remote pipenv config

I made a directory "t" on the remote host, and initialized pipenv there. Installed a few dependencies. So:

  • mkdir t
  • cd t
  • pipenv install pandas

And there we have the basic pipenv setup on the remote host. To find the pipenv dir on remote host (t is the dir where pipenv was created above):

[ec2-user@ip-172-31-3-125 t]$ pipenv --venv
/home/ec2-user/.local/share/virtualenvs/t-x5qHNh_c

To see what it contains:

[ec2-user@ip-172-31-3-125 t]$ ls /home/ec2-user/.local/share/virtualenvs/t-x5qHNh_c
bin  include  lib  lib64  src
[ec2-user@ip-172-31-3-125 t]$ ls /home/ec2-user/.local/share/virtualenvs/t-x5qHNh_c/bin
activate       activate.ps1      chardetect        pip     python     python-config
activate.csh   activate_this.py  easy_install      pip3    python3    wheel
activate.fish  activate.xsh      easy_install-3.7  pip3.7  python3.7

To get python interpreter name:

[ec2-user@ip-172-31-3-125 t]$ pipenv --py
/home/ec2-user/.local/share/virtualenvs/t-x5qHNh_c/bin/python

This is just a link to python3:

[ec2-user@ip-172-31-3-125 t]$ ls -l /home/ec2-user/.local/share/virtualenvs/t-x5qHNh_c/bin/python
lrwxrwxrwx 1 ec2-user ec2-user 7 Nov  7 20:55 /home/ec2-user/.local/share/virtualenvs/t-x5qHNh_c/bin/python -> python3

Use that to configure this pipenv as remote executor, as shown above already:

Remote pipenv config

UPDATE:

Besides automated sync, I found the Pycharm IDE has features for manual upload to / download from the remote server. Seems quite useful.

First of all, the root of the remote deployment dir is defined in Deployment Configuration / Root Path. Under Deployment / Options, you can also disable the automated remote sync. Just set "Update changed files automatically to the default server" to "never". Here I have set the root dir to "/home/ec2-user". Which means the temp directory I discussed above actually is created under /home/ec2-user/tmp/pycharm_project_703/…

Deployment config

With the remote configuration defined, you can now view files on the remote server. First of all, enable the View->Tools Windows->Remote Host. This opens up the Remote Host view on the right hand side of the IDE window. The following shows a screenshot of the PyCharm IDE with this window open. The popup window (as also shown) lets you also download/upload files between the remote host and the localhost:

Deployment view

In a similar way, we can also upload local files to the remote host using the context menu for the files:

Upload to remote

One can also select entire folders for upload / download. The root path on the remote host used for all this is the one I discussed above (e.g., /home/ec2-user as defined higher above).

Conclusions

I haven’t used this feature on a large scale yet, but it seems very useful. The issue I keep linking discusses one option of using it to run data processing on a large desktop system from a laptop. I also find it interesting for just running experiments in parallel on a separate machine, or for using cloud infrastrucure while developing.

The issue also has some discussion with potential pipenv management from PyCharm coming in 2020.1 or 2020.2 version. Just speculation, of course. But until then one can set up the virtualenv using pipenv on remote host and just use the interpreter path above to set up the SSH Interpreter. This works to run the code inside the pipenv environment.

Some issues I ran into included PyCharm apparently only keeping a single state mapping in memory for remote and local file diffs. PyCharm synchronizes files very well, and identifies changes to upload new files. But if I change the remote host address, it seems to still think it has the same delta. Not a big issue, but something to keep in mind as always.

UPDATE: The manual sync I added a description for it actually quite nice way to bypass the issues on automated sync. Of course it is manual, and using it for uploading everything all the time in a big project is not useful. But for me and my projects it has been nice so far..

That’s all.

Robot Framework by Examples

Introduction

Robot Framework (RF) is a popular keyword driven test framework (at least in Finland it seems to be..). Recently had to look into it again for some potential work related opportunities. Have to say open source is great but the docs could use improvements..

I made a few examples for the next time I come looking:

Installing

To install RF itself, in Python pip does the job. Installing RF itself, along with Selenium keywords, and Selenium Webdriver for those keywords:

pip3 install robotframework
pip3 install selenium
pip3 install robotframework-seleniumlibrary

Using Selenium WebDriver as an example here, a Selenium driver for the selected browser is needed. For Chrome, one can be downloaded from the Chrome website itself. Similarly for other browsers on their respective sites. The installed driver needs to be on the search path for the operating system. On macOS, this is as simple as adding it to the path. Assuming the driver is in currect directory:

PATH=.:$PATH

So just the dot, which works as long as the driver file is in the working directory when running the tests.

In PyCharm, the PATH can also be similarly added to run configuration environment variables.

General RF Script Structure

RF script elements are separated by minimum of 2 space indentation. Both indenting test steps under a test, and also to separate keywords and parameters. There is also the pipe separated format which might look a bit fancier, if you like. Sections are identified by three stars *** and a pre-defined name for the section.

The following examples illustrate.

Examples

Built-in Keywords / Logging to console

The built-in keywords are avaiable without needing to import a specific library. Rather they are part of the built-in library. Simple example of logging some statement to console:

The .robot script (hello.robot in this case):

*** Test Cases ***
Say Hello
    Log To Console    Hello Donkey
    No Operation
    Comment           Hello ${bob}

The built-in keyword "Log To Console" writes the given parameter to the log file. A hello world equivalent. To run the test, we can either write code to invoke the RF runner from Python or use RF command line tools. Python runner example:

from robot.running import TestSuiteBuilder
from robot.api import ResultWriter

#https://robot-framework.readthedocs.io/en/3.0/autodoc/robot.running.html
suite = TestSuiteBuilder().build("./hello.robot")
result = suite.run(output="test_output.xml")
#ResultWriter(result).write_results(report='report.html', log="log.html")
ResultWriter("test_output.xml").write_results(report='report.html', log="log.html")

The "hello.robot" in above is the name of the test script file listed above also.

The strangest thing (for me) here is the writing of the log file. The docs suggest to use the first approach I commented out above. The ResultWriter with the results object as a parameter. This generates the report.html and the log.html.

The problem is, the log.html is lacking all the prints, keywords, and test execution logs. Later on the same docs state that to get the actual logs, you have to pass in the name of the XML file that was created by the suite.run() method. This is the uncommented approach in the above code. Since the results object is also generated from this call, why does it not give the proper log? Oh dear. I don’t understand.

Commandline runner example:

robot hello.robot

This seems to automatically generate an appropriate log file (including execution and keyword trace). There are also a number of command line options available, for all the properties I discuss next using the Python API. Maybe the general / preferred approach? But somehow I always end up needing to do my own executors to customize and integrate with everything, so..

Finally on logging, Robot Framework actually captures the whole stdout and stderr, so statements like print() get written to the RF log and not to actual console. I found this to be quite annoying and resulting in overly verbose logs with all the RF boilerplate/overhead. There is a StackOverflow answer on how to circumvent this though, from the RF author himself. I guess I could likely write my own keyword to use that if needed to get more log customization, but seems a bit complicated.

Tags and Critical Tests

RF tags are something that can be used to filter and group tests. One use is to define some tests as "critical". If a critical test fails, the suite is considered failed.

Example of non-critical test filtering. First, defining two tests:

*** Test Cases ***
Say Hello Critical
	[Tags]            crit
    Log To Console    Hello Critical Donkey
    No Operation
    Comment           Hello ${bob}

Say Hello Non-Critical
	[Tags]            non-crit
    Log To Console    Hello Nice Donkey
    No Operation
    Comment           Hello ${bob}

Running them, while filtering with wildcard:

from robot.running import TestSuiteBuilder
from robot.api import ResultWriter

#https://robot-framework.readthedocs.io/en/3.0/autodoc/robot.running.html
suite = TestSuiteBuilder().build("./noncritical.robot")
result = suite.run(output="test_output.xml", noncritical="*crit")
ResultWriter("test_output.xml").write_results(report='report.html', log="log.html")

The above classifies all tests that have tags matching the regexp "*crit" as non-critical. In this case, it includes both the tags "crit" and "non-crit", which would likely be a bit wrong. So the report for this actually shows 2 non-critical tests.

The same execution with a non-existent non-critical tag:

from robot.running import TestSuiteBuilder
from robot.api import ResultWriter

#https://robot-framework.readthedocs.io/en/3.0/autodoc/robot.running.html
suite = TestSuiteBuilder().build("./noncritical.robot")
#this tag does not exist in the given suite, so no critical tests should be listed in report
result = suite.run(noncritical="non")
ResultWriter(result).write_results(report='report.html', log="log.html")

This runs all tests as critical, since no test has a tag of "non". To finally fix it, the filter should be exactly "non-crit". This would not match "crit" but would match exactly "non-crit".

Filtering / Selecting Tests

There are also keywords include and exclude. To include or exclude (surprise) tests with matching tags from execution.

A couple of tests with two different tags (as before):

*** Test Cases ***
Say Hello Critical
	[Tags]            crit
    Log To Console    Hello Critical Donkey
    No Operation
    Comment           Hello ${bob}

Say Hello Non-Critical
	[Tags]            non-crit
    Log To Console    Hello Nice Donkey
    No Operation
    Comment           Hello ${bob}

Run tests, include with wildcard:

from robot.running import TestSuiteBuilder
from robot.api import ResultWriter
from io import StringIO

#https://robot-framework.readthedocs.io/en/3.0/autodoc/robot.running.html
suite = TestSuiteBuilder().build("./include.robot")
stdout = StringIO()
result = suite.run(include="*crit", stdout=stdout)
ResultWriter(result).write_results(report='report.html', log="log.html")
output = stdout.getvalue()
print(output)

This includes both of the two tests defined above, since the tags match. If the filter was "non", nothing would match, and error is produced for no tests to run.

Creating new Keywords from Existing Keywords

Besides somebody elses keywords, custom keywords can be extended from existing keywords. Example test file:

*** Settings ***
Resource    simple_keywords.robot

*** Test Cases ***
Run A Google Search
    Search for      chrome    emoji wars
    Sleep           10s
    Close All Browsers

The included (by the Resource keyword above) file simple_keywords.robot:

*** Settings ***
Library  SeleniumLibrary

*** Keywords ***
Search for
    [Arguments]    ${browser_type}    ${search_string}
    Open browser    http://google.com/   ${browser_type}
    Press Keys      name:q    ${search_string}+ENTER

So the keyword is defined above in a separate file, with arguments defined using the [Arguments] notation. Followed by the argument names. Which are then referenced in following keywords, Open Browser and Press Keys, imported from SeleniumLibrary. Simple enough.

Selenium Basics on RF

Due to popularity of Selenium Webdriver and testing of web applications, there is a specific RF library with keywords built for it. This was installed way up in Installing section.

Basic example:

*** Settings ***
Library  SeleniumLibrary

*** Test Cases ***
Run A Google Search
    Open browser    http://google.com/   Chrome
    Press Keys      name:q    emoji wars+ENTER
    Sleep           10s
    Close All Browsers

Run it as always before:

from robot.running import TestSuiteBuilder
import robot

#https://robot-framework.readthedocs.io/en/3.0/autodoc/robot.running.html
suite = TestSuiteBuilder().build("./google_search.robot")
result = suite.run()

This should open up Chrome browser, load Google on it, do a basic search, and close the browser windows. Assuming it founds the Chrome driver also listed in the Installing section.

Creating New Keywords in Python

Besides building keywords as composites of existing ones, building new ones with Python code is an option.

Example test file:

*** Settings ***
Library         google_search_lib.py    chrome

*** Test Cases ***
Run A Google Search
    Search for      emoji wars
    Sleep           10s
    Close

The above references google_search_lib.py, where the implementation is:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

class google_search_lib(object):
    driver = None

    @classmethod
    def get_driver(cls, browser):
        if cls.driver is not None:
            return cls.driver
        if (browser.lower()) == "chrome":
            cls.driver = webdriver.Chrome("../chromedriver")
        return cls.driver

    def __init__(self, browser):
        print("creating..")
        driver = google_search_lib.get_driver(browser)
        self.driver = driver
        self.wait = WebDriverWait(driver, 10)

    def search_for(self, term):
        print("open")
        self.driver.get("http://google.com/")
        search_box = self.driver.find_element_by_name("q")
        search_box.send_keys(term)
        search_box.send_keys(Keys.RETURN)

    def close(self):
        self.driver.quit()

Defining the library import names is a bit tricky. If it is the same in both cases (module + class) just one is needed.

Again, running it as before:

from robot.running import TestSuiteBuilder

suite = TestSuiteBuilder().build("./google_search.robot")
result = suite.run()

If you think about this for a moment, there is some strange magic here. Why is the classmethod there? How is state managed within tests / suites? I borrowed the initial code for this example from this fine tutorial. It does not discuss the use of this annotation, but it seems to me that this is used to shared the driver object during test execution.

Mapping Python Functions to Keywords

It is simply by taking the function name and underscores for space. So in the above google_search_lib.py example, the Search For maps to the search_for() function. Close keyword maps to close() function. Much complex, eh?

Test Setup and Teardown

Test setup and teardown are some basic functionality. This is supported in RF by specific keywords in the Settings section.

Example test file:

*** Settings ***
Library         google_search_lib.py    chrome
Test Setup      Log To Console    Starting a test...
Test Teardown   Close

*** Test Cases ***
Run A Google Search
    Search for      emoji wars
    Sleep           10s

The referenced google_search_lib.py file is the same as above. This includes defining the close function / keyword used in Test Teardown.

Run it as usual:

from robot.running import TestSuiteBuilder

suite = TestSuiteBuilder().build("./google_search.robot")
result = suite.run()

You can define a single keyword for both setup and teardown. RF docs suggest to write your own custom keyword, composing multiple actions as needed.

The way the library class is defined and created is also impacted on how the scope of the library is defined. It seems to get a bit tricky to manage the resources, since sometimes the instances are different in the setup, teardown, tests, or in all tests. I think this is one of the reasons for using the classmethod annotation in the tutorial example I cited.

There would be much more such as variables in tests. And RF also supports the BDD (Gherkin) syntax in addition to the keywords I showed here. But the underlying framework is quite the same in both cases.

Anyway, that’s all I am writing on today. I find RF is quite straightforward once you get the idea, and not too complex to use even with the docs not being so straightforward. Overall, a very simple concept, and I guess one that the author(s) have managed to build a reasonable community around. Which I guess is what makes it useful and potentially successfull.

I personally prefer writing software over putting keywords after one another, but for writing tests I guess this is one useful method. And maybe there is an art in itself to writing good, suitably abstracted, reusable yet concrete keywords?

That’s all, folks,…

A Look into AWS Elastic Container Service

Intro

Recently, I got myself the AWS Certified Associate Solutions Architecture certificate. To start getting more familiar with it, I did the excellent Cloud Guru class on the certificate preparation on Udemy. One part that was completely missing from that preparatory course was ECS. Yet questions related to it came up in the exam. Questions on ECS and Fargate, among others. I though maybe Fargate is something from Star Trek. Enter the Q continuum? But no.

Later on, I also went through the Backspace preparatory course on Udemy, which briefly touches on ECS, but does not really give any in-depth understanding. Maybe the certificate does not require it, but I wanted to learn it to understand the practical options on working with AWS. So I went on to explore.. and here it is.

Elastic Container Service (ECS) is an AWS service for hosting and running Docker images and containers.

ECS Architecture

The following image illustrates the high-level architecture, components, and their relations in ECS (as I see it):

ECS High-Level Architecture

The main components in this:

  • Elastic Container Service (ECS): The overarching service name that is composed of the other (following) elements.
  • Elastic Container Registry (ECR): basically handles the role of private Docker Hub. Hosts Docker images (=templates for what a container runs).
  • Docker Hub: The general Docker Hub on the internet. You can of course use standard Docker images and templates on AWS ECS as well.
  • Docker/task runner: The hosts running the Docker containers. Fargate or EC2 runner.
  • Docker image builder: Docker images are built from specifications given in a DockerFile. The images can then be run in a Docker container. So if you want to use your own images, you need to build them first, using either AWS EC2 instances, or your own computers. Upload the build images to ECR or Docker Hub. I call the machine used to do the build here "Docker Image Builder" even if it is not an official term.
  • Event Sources: Triggers to start some task running in an ECS Docker container. ELB, Cloudwatch and S3 are just some examples here, I have not gone too deep into all the possibilities.
  • Elastic Load Balancer (ELB): To route incoming traffic to different container instances/tasks in your ECS configuration. So while ELB can "start tasks", it can also direct traffic to running tasks.
  • Scheduled tasks: Besides CloudWatch events, ECS tasks may be manually started or scheduled over time.

Above is of course a simplified description. But it should capture the high level idea.

Fargate: Serverless ECS

Fargate is the "serverless" ECS version. This just means the Docker containers are deployed on hosts fully managed by AWS. It reduces the maintenance overhead on the developer/AWS customer as the EC2 management for the containers is automated. The main difference being that there is no need to define the exact EC2 (host) instance types to run the container(s). This seems like a simply positive thing for me. Otherwise I would need to try to calculate my task resource definitions vs allocated containers, etc. So without Fargate, I need to manage the allocated vs required resources for the Docker containers manually. Seems complicated.

Elastic Container Registry / ECR

ECR is the AWS integrated, hosted, and managed container registry for Docker images. You build your images, upload them to ECR, and these are then available to ECS. Of course, you can also use Docker Hub or any other Docker registry (that you can connect to), but if you run your service on AWS and want to use private container images, ECR just makes sense.

When a new Docker container is needed to perform a task, the AWS ECS infrastructure can then pull the associated "container images" from this registry and deploy them in ECS host instances. The hosts being EC2 instances with the ECS-agent running. The EC2 instances managed either by you (EC2 ECS host type) or by AWS (Fargate).

Since hosting custom images with own code likely includes some IPR you don’t want to share with everyone, ECR is encrypted, as well as all communication with it. There are also ECS VPC Endpoints available to further secure the access and to reduce the communication latencies, removing public Internet roundtrips, with the ECR.

As for availability and reliability, I did not directly find good comments on this, except that the container images and ECR instances are region-specific. While AWS advertises ECR as reliable and scalable and all that, I guess this means they must simply be replicated within the region.

Besides being region-specific, there are also some limitations on the ECS service. But these are in the order of max 10000 repositories per region, each with max of 10000 images. And up to 20 docker pull type requests per second, bursting up to 200 per second. I don’t see myself going over those limits, pretty much ever. With some proper architecting, I do not see this generally happening or these limits becoming a problem. But I am not running Netflix on it, so maybe someone else has it bigger.

ECS Docker Hosting Components

The following image, inspired by a Medium post (thanks!), illustrates the actual Docker related components in ECS:

ECS Docker Components

  • Cluster: A group of ECS container instances (for EC2 mode), or a "logical grouping of tasks" (Fargate).
  • Container instance: An EC2 instance running the ECS-agent (a Go program, similar to Docker daemon/agent).
  • Service: This defines what your Docker tasks are supposed to do. It defines the configuration, such as the Task Defition to run the service, the number of task instances to create from the definition, and the scheduling policy. I see this as a service per task, but defining also how multiple instances of the tasks work together to implement a "service", and their related overall configuration.
  • Task Definition: Defines the docker image, resources (CPU, memory), instance type (micro, nano, macro, …), IAM roles, image boot command, …
  • Task Instance: An instantiation of a task definition. Like docker run on your own host, but for the ECS.

Elastic Load Balancer / ELB with ECS

The basic function of a load balancer is to spread the load for an ECS service across its multiple tasks running on different host instances. Similar to "traditional" EC2 scaling based on monitored ELB target health and status metrics, scaling on ECS can also be triggered. Simply based on ECS tasks vs pure EC2 instances in a traditional setting.

As noted higher above, an Elastic Load Balancer (ELB) can be used to manage the "dynamics" of the containers coming and going. Unlike in a traditional AWS load balancer setting, with ECS, I do not register the containers to the ELB as targets myself. Instead, the ECS system registers the deployed containers as targets to the ELB target group as the container instances are created. The following image illustrates the process:

ELB with ECS

The following points illustrate this process:

  • ELB performs healthchecks on the containers with a given configuration (e.g., HTTP request on a path). If the health check fails (e.g., HTTP server does not respond), it terminates the associated ECS task and starts another one (according to defined ESC scaling policy)
  • Additionally there are also ECS internal healthchecks for similar purposes, but configured directly on the (ECS) containers.
  • Metrics such as Cloudwatch monitoring ECS service/task CPU loads can be used to trigger autoscaling, to deploy new tasks for a service (up-scaling) or remove excess tasks (down-scaling).
  • As requests come in, they are forwarded to the associated ECS tasks, and the set of tasks may be scaled according to the defined service scaling policy.
  • When a new task / container instance is spawned, it registers itself to the ELB target group. The ELB configuration is given in the service definition to enable this.
  • Additionally, there can be other tasks not associated to the ELB, such as scheduled tasks, constantly running tasks, tasks triggered by Cloudwatch events or other sources (e.g., your own code on AWS), …

Few points that are still unclear for me:

  • An ELB can be set to either instance or port type. I experimented with simple configurations but had the instance type set. Yet the documentation states that with awsvpc network type I should use IP based ELB configuration. But it still seemed to work when I used instance-type. Perhaps I would see more effect with larger configurations..
  • How the ECS tasks, container instances, and ELBs actually relate to each other. Does the ELB actually monitor the tasks or the container instances? Does the ELB instance vs port type impact this? Should it monitor tasks but I set it to monitor instances, and it worked simply because I was just running a single task on a single instance? No idea..

Security Groups

As with the other services, such as Lambda in my previous post, to be able to route the traffic from the ELB to the actual Docker containers running your code, the security groups need to be configured to allow this. This would look something like this:

ELB ECS SG

Here, the ELB is allowed to accept connections from the internet, and to make connections to the security group for the containers. The security groups are:

  • SG1: Assigned to the ELB. Allows traffic in from the internet. Because it is a security group (not a network access control list]), traffic is also allowed back out if allowed in.
  • SG2: Assigned to the ECS EC2 instances. Allows traffic in from SG1. And back out, as usual..

Final Thoughts

I found ECS to be reasonably simple, and providing useful services to simplify management of Docker images and containers. However, Lambda functions seem a lot simpler still, and I would generally use those (as the trend seems to be..). Still, I guess there are still be plenty of use cases for ECS as well. For those with investments into, or otherwise preferences for containers, and for longer running tasks, or otherwise tasks less suited for short invocation Lambdas.

As the AWS ECS-agent is just an open-source program written in Golang and hosted on Github, it seems to me that it should be possible to host ECS agents anywhere I like. Just as long as they could connect to the ECS services. How well that would work from outside the core AWS infrastructure, I am not sure. But why not? Have not tried it, but perhaps..

Looking at ECS and Docker in general, Lambda functions seem like a clear evolution path from this. Docker images are composed from DockerFiles, which build the images from [stacked layers(https://blog.risingstack.com/operating-system-containers-vs-application-containers/)%5D, which are sort of build commands. The final layer being the built "product" of those layered commands. Each layer in a Dockerfile builds on top of the previous layer.

Lambda functions similarly have a feature called Lambda Layers, which can be used to provide a base stack for the Lambda function to execute on. They seem a bit different in defining sets of underlying libraries and how those stack on top of each other. But the core concept seems very similar to me. Finally, the core a Lambda function is the function that executes when a triggering event runs, similar to the docker run command in ECS. Much similar, such function.

The main difference for Lambda vs ECS perhaps seems to be in requiring even less infrastructure management from me when using Lambda (vs ECS). The use of Lambda was illustrated in my earlier post.

S3 Policies and Access Control

Learning about S3 policies.

AWS S3 bucket access can be controlled by S3 Bucket Policies and by IAM policies. If using combinations of both, there is an AWS blog post showing how the permissions get evaluated:

  • If there is an explicit deny in Bucket Policies or IAM Policies, access is denied. Even if another rule would allow it, a single deny trumps any allows that would also match.
  • If there is an explicit allow in Bucket Policies or IAM Policies, and no deny, access is allowed.
  • If there is no explicit rule matching the request/requested, the access is denied.

An example JSON bucket policy is given as:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": ["arn:aws:iam::111122223333:user/Alice",
                "arn:aws:iam::111122223333:root"]
      },
      "Action": "s3:*",
      "Resource": ["arn:aws:s3:::my_bucket",
                   "arn:aws:s3:::my_bucket/*"]
    }
  ]
}

Policy Elements

The elements of the policy language, as used in above example:

  • Version is the date when a version was defined:

    • "2012-10-17" is practically version 2 of the policy language.
    • "2008-10-17" is version 1, and the default if no version is defined.
    • The newer version allows use of policy variables and likely some other minor features.
    • AWS recommends to always use the latest version definition.
  • Statement is cunningly named, since (as far as I understand), there can only be one such element in the policy. But it contains "multiple statements", each defining some permission rule.

  • Effect: Deny or Allow

  • Principal: Who is effected by this statement.

  • NotPrincipal: Can be used to define who should NOT be affected by this rule. For example, apply to all but the root account by using this to exclude root.

  • Action: The set of operations (actions) that this affects. For example "s3:*" means the rule should affect (deny/allow) all S3 operations.

  • NotAction: Opposite of Action, so "s3:*" here would mean the rule applies to all actions except those related to S3.

  • Resource: The resources the rule should apply to. Here it is the given bucket and all files inside the bucket.

  • NotResource: The resources the rules should not apply to. Giving a specific bucket would mean the rule should apply to all buckets except that one. Or perhaps all resources (including S3 but anything else AWS considers a resource)..?

Principal

The Principal can be defined as single user, a wildcard, or a list of users. Examples:

List:

      "Principal": {
        "AWS": ["arn:aws:iam::111122223333:user/Alice",
                "arn:aws:iam::111122223333:root"]
      },

In above, user/Alice simply refers to an IAM user named Alice. This is the user name given in the IAM console when creating/editing the user. Root refers to the account root user.

Single user:

    "Principal": { "AWS": "arn:aws:iam::123456789012:root" }

The Principal docs also refer to something with a shorter notation:

    "Principal": { "AWS": "123456789012" }

I can confirm that the above style gives no errors in defining the policy, which always seems to be immediately verified by AWS. But I could not figure out what this would be used for exactly. I mean, I expected this to be a rule affecting every user for the account. Did not seem to be so for my account. I posted some question, will get back to it if I get an answer..

Example Policies

Here follows a few example policies for practical illustration. Using three different users:

  • root user,
  • user "bob" with S3 admin role, and
  • "randomuser" with no S3 access defined.

These use an example account with ID 123456789876, and a bucket with name "policy-testing-bucket":

First misconception

I was thinking this one was needed to give Bob and the root account access to the bucket and all files inside it. Including all the S3 operations:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": [
                    "arn:aws:iam::123456789876:user/bob",
                    "arn:aws:iam::123456789876:root"
                ]
            },
            "Action": "s3:*",
            "Resource": [
                "arn:aws:s3:::policy-testing-bucket",
                "arn:aws:s3:::policy-testing-bucket/*"
            ]
        }
    ]
}

The above seemed to work in allowing bob access. But not really.

Only allor root by bucket policy

The following policy should stop Bob from accessing the bucket and files, since by default all access should be denied, and this one does not explicitly allow Bob. And yet Bob can access all files.. More on that follows. The policy in question, that does not explicitly allow Bob:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": [
                    "arn:aws:iam::123456789876:root"
                ]
            },
            "Action": "s3:*",
            "Resource": [
                "arn:aws:s3:::policy-testing-bucket",
                "arn:aws:s3:::policy-testing-bucket/*"
            ]
        }
    ]
}

Denying Bob

Since the above did not stop Bob from accessing the files, how about the following to explicitly deny him:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": [
                    "arn:aws:iam::123456789876:root"
                ]
            },
            "Action": "s3:*",
            "Resource": [
                "arn:aws:s3:::policy-testing-bucket",
                "arn:aws:s3:::policy-testing-bucket/*"
            ]
        },
        {
            "Effect": "Deny",
            "Principal": {
                "AWS": [
                    "arn:aws:iam::123456789876:user/bob"
                ]
            },
            "Action": "s3:*",
            "Resource": [
                "arn:aws:s3:::policy-testing-bucket",
                "arn:aws:s3:::policy-testing-bucket/*"
            ]
        }
    ]
}

The above works to stop Bob from accessing the bucket and files, as a Deny rule always trumps an Allow rule, regardless of their ordering.

IAM roles and their interplay with bucket policies

I realized the above is all because I put Bob in a "developer" group I had created, and this group has a general policy as:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "*",
      "Resource": "*"
    }
  ]
}

This allows anyone in this group (including Bob) to access any resources with any actions. Including resources in S3, and all actions in S3. So the explicit Allow rule higher above is not needed, Bob already has access via the his IAM group policy. This is why hist access works even if I remove him from the bucket policy. And if an explicit deny is defined, as also higher above, then the Deny trumps the IAM role Allow and blocks the access.

User with no IAM permissions

Allow listing bucket content and opening files

Since Bob was in developer group and thus has access granted via his IAM role, I also tried with RandomUser, who has no IAM role, and should be directly impacted by the S3 policy. This policy gives RandomUser access to list bucket contents and open files. but not to see bucket list, or the bucket in that list:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::123456789876:user/arandomuser"
            },
            "Action": "s3:*",
            "Resource": [
                "arn:aws:s3:::policy-testing-bucket",
                "arn:aws:s3:::policy-testing-bucket/*"
            ]
        }
    ]
}

Block filelist view, allow file access

This stops RandomUser from seeing the filelist for the bucket but allows to open specific files in it:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::123456789876:user/arandomuser"
            },
            "Action": "s3:*",
            "Resource": [
                "arn:aws:s3:::policy-testing-bucket/*"
            ]
        }
    ]
}

Allow filelist, block file opening

This lets RandomUser see the bucket filelist but prevents opening files in it:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::123456789876:user/arandomuser"
            },
            "Action": "s3:*",
            "Resource": [
                "arn:aws:s3:::policy-testing-bucket"
            ]
        }
    ]
}

If not allowed by IAM or Bucket Policy, denied by default

If RandomUser is now allowed explicitly in bucket policies, he gets access denied:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::123456789876:root"
            },
            "Action": "s3:*",
            "Resource": [
                "arn:aws:s3:::policy-testing-bucket"
            ]
        }
    ]
}

Adventures in AWS: Scraping with Lambdas

Amazon Web Service (AWS) are increasingly popular in the IT business and job market. To learn more about them, I took one of the AWS certification courses from A Cloud-Guru on Udemy for 10 euros. Much expensive, but investment. There are no certification exam locations where I live but maybe some day I have a chance to try it. In the mean time, learning is a possibility regardless. The course was very nice, including some practice labs. I could probably pass the test with just that, but doing a small experiment/exercise of my own tends to make it a bit more clear and gives some concrete experience.

So I built myself a small "service" using AWS. It collects chat logs from a public internet chat-service (scraping), collects them into a database, and would provide some service based on a machine-learning model trained on the collected data. I created a Discord test server and tried scraping that to see it works, and requested a Twitter development account to try that as well later. In this post I describe the initial data collection service part, which is plenty enough for one post. Will see about the machine learning and API service on it later. Maybe an interactive Twitter bot or something.

Architecture

The high-level architecture and the different AWS services I used is shown in the following figure:

high-level architecture

The components in this figure are:

  • VPC: The main container for providing a "virtual private cloud" for me to do my stuff inside AWS.
  • AZ: A VPC is hosted in a region (e.g., eu-central, also known as Frankfurt in AWS), which has multiple Availability Zones (AZ). These are "distinct locations engineered to be isolated from failures in other AZs". Think fire burning a data center down or something.
  • Subnets split a VPC into separate parts for more fine-grained control. Typically one subnet is in one AZ for better resiliency and resource distribution.
  • Private vs public subnet: Public subnet has an internet gateway defined so you can give it a public IP addresses, access internet from within it, and allow incoming connections. Private has none of that.
  • RDS: Maria DB in this case. This is a relational database system (RDS), provided by AWS as a service.
  • S3 Endpoint: Provides direct link from the subnet to S3. Otherwise S3 access would be routed through internet. S3 is Simple Storage Service, AWS file object store.
  • Internet gateway: provides a route to the internet. Otherwise nothing in the subnet can access the internet outside VPC.
  • EC2 instance: Plain virtual machine. I used it to access the RDS with MariaDB command line tools, from inside the VPC.
  • Lambda Functions: AWS "serverless" compute components. You upload the code to AWS, which deploys, runs, and scales it based on given trigger events as needed.
  • Scraper Lambda: Does the actual scraping. Runs in the public subnet to be able to access the internet. Inserts the scraped data into S3 as a single file object once per day (or defined interval).
  • Timestamp Lambda: Reads the timestamps of latest scraped comments per server and chat channel, so Scraper Lambda knows what to scrape.
  • DB Insert Lambda: Reads the scraper results from S3, inserts them into the RDS.
  • S3 chat logs: S3 bucket to store the scraped chat logs. As CSV files (objects in S3 terms).

In the above architecture I have the Scraper Lambda outside the VPC, and the other two Lambda inside the VPC. A Lambda inside the VPC can be configured to have access to the resources within the VPC, such as the RDS database. But if I want to access the Internet from an in-VPC Lambda, I need to add a NAT-Gateway. A Lambda outside the VPC, such as the Scraper Lambda here, has access to the Internet, so it needs no specific configuration for that. But being outside the VPC, it does not have access to the in-VPC RDS, so it needs to communicate with the in-VPC Lambda functions for that.

The dashed arrows simply show possible communications. The private subnets have no route to the internet but can communicate with the other subnets. This can be further constrained by various security configurations that I look at later.

NAT-Gateway

Another option would be to use a NAT-GW (NAT-Gateway) to put the Scraper Lambda also inside the VPC, as illustrated by this architecture:

NAT-GW architecture

A NAT-GW provides access to the internet from within a private subnet by using Network Address Translation (NAT). So it routes traffic from private subnets/private network interfaces through the Internet Gateway (via a public subnet). It does not provide a way to access the private subnet from the outside, but that would not be required here. This is illustrated by the internet connection arrows in the figure, where the private subnets would pass through the NAT gateway, which would pass the traffic out through the internet gateway. But there is no arrow in the other direction, as there is no way to provide a connection interface to the private subnets from the internet in this.

A NAT-GW here could both complicate or simplify things. With NAT-GW, I could combine the Timestamp Lambda with the Scraper Lambda, and just have the Scraper Lambda read the timestamps direct from the RDS itself. This is illustrated in the architecture diagram above.

In detail, there are two ways a Lambda can be invoked by another Lambda. Asynchronous and synchronous. In synchronous, the calling Lambda waits for the results of the call before proceeding. Asynchronous just starts another Lambda in parallel, without further connection. The Scraper->Timestamp is a synchronous call as Scraper requires the Timestamp information to proceed. This is the only use for Timestamp Lambda in this architecture. So, if possible, these could be combined.

In this option, I would still keep the DB Insert Lambda separate, as it can run asynchronously on its own, reading the latest data from S3 without any direct link to anything else. In this way, I find the use of Lambdas can also help keep the independent functions separate. For me, this commonly leads to better software design.

However, a NAT-GW is a billable service, costs money and is not included in the free tier. OK, it is from about 1 euro per day up, depending on the bandwidth used. For a real company use-case, this would likely be rather negligible cost. But for poor me and my little experiments.. And this current architecture let me try some different configurations so, sort of win-win.

Service Endpoints

The following two figures illustrate the difference of data transfer when using the S3 endpoint vs not:

With S3 endpoint:

with S3 endpoint

Without the S3 endpoint:

without S3 endpoint

So how do these work? As far as I understand, both interface and gateway endpoints are based on routing and DNS tricks in the associated subnets. Again, getting into further details seems to get a bit complicated. For a gateway endpoint, such as S3 endpoint, the endpoint must be added to the subnet route table to work. But what happens if you do not add it, but do have an internet connection? My guess it, it will still be possible to connect to S3 but you will be routed through the internet. How AWS handles the DNS requests internally, and do you have visibility into the actual routes that are taken in real-time/during operation? I don’t know.

In any case, as long as using the DNS-style names to access the services, the AWS infrastructure should do the optimal routing via endpoints if available. For interface endpoints, the documents mention something called Private DNS, which seems to do a similar thing. Except it does not seem to use similar route table mappings as Gateway endpoints. I guess the approach for making use of endpoints when possible would be to use the service DNS-style names, and consistently review all route tables and other configs. As this seems like a possibly common and general problem, perhaps there are some nice tools for this but I don’t know.

It seems to me it makes a lot more sense to use such endpoint services to connect directly to the AWS services, since we are already running withing AWS. In fact, it seems strange that the communication would otherwise (and I guess before the S3 endpoints existed, the only option) by default take a detour through the internet.

Use of endpoints seems much more effective in terms of performance and bandwidth use, but also so in terms of cost. Traffic routed through the internet gets billed separately in AWS, whereas gateway endpoint traffic stays within the AWS and is thus not separately billed. Meaning, in my understanding, that the S3 endpoint is "free". So why would you not ever use it? No idea.

This is just the S3 endpoint. AWS has similar endpoints for most (if not all) its services. The S3 endpoint is a "gateway endpoint", and one other service that currently supports this is DynamoDB. Other services have what is called an "interface endpoint". Which seems to be a part of something called PrivateLink. Whatever that means.. These cost money, both for hourly use and bandwidth.

With a quick Internet search, I found the Cloudonaut page to be a bit more clear on the pricing. But I guess you never know with 3rd party sites if they are up to date to latest changes. Would be nice if Amazon would provide some nice and simple way to see pricing for all, now I find it a bit confusing to figure out.

Lambda Triggers

The AWS Lambda functions can be triggered from multiple sources. I have used two triggers here:

  • Scheduled time trigger from Cloudwatch. Triggers the Scraper Lambda once a day.
  • Lambda triggering another Lambda synchronously. The Scraper Lambda invokes the Timestamp Lambda to define which days to scrape (since previous timestamp). Synchronous simply means waiting for the result before progressing.
  • Lambda triggering another lambda asynchronously. Once the Scraper Lambda finishes its scraping task, it invokes the DB Insert Lambda to check S3 for new data, and insert into the RDS DB.

It seems a bit challenging to find a concrete list of what are all the possible Lambda triggers. The best way I found is to start creating a Lambda, hit the "triggers" button to add a new trigger for the new Lambda, and the just scroll the list of options. Some main examples I noted:

  • API Gateway (API-GW) events.
  • AWS IoT: AWS Button events and custom events (whatever that means..). Never tried the AWS IoT stuff.
  • Application load balancer events
  • Cloudwatch logs, events when new logs are received to a configured log group
  • Code commit: AWS provides some form of version control system support. This triggers events from Git actions (e.g., push, create branch, …).
  • Cognito: This is the AWS authentication service. This is a "sync" trigger, so I guess it gets triggered when authentication data is synced.
  • DynamoDB: DynamoDB is the AWS NoSQL database. Events can be triggered from database updates, in batches if desired. Again, I have not used it, just my interpretation of the documentation.
  • Kinesis: Kinesis is the AWS service for processing real-time timeseries type data. This seems to be able to trigger on the data stream updates, and data consumer updates.
  • S3: Events on create (includes update I guess) and delete of objects, events on restoring data from Glacier.
  • RRS object loss. RRS is reduced redundancy storage, with more likely chance that something is lost than on standard S3.
  • SNS: Triggers on events in the simple notifaction service (SNS).
  • SQS: Updates on an event stream in simple queue service (SQS). Can also be batched.

That’s all interesting.

Security Groups and Service Policies

To make all my service instances connect, I need to define all my VPC network, service and security configurations, etc. A security group is a way to configure security attributes (AWS describes it as a "virtual firewall"). Up to 5 (five) security groups can be assigned to an instance. Each security group then defines a set of rules to allow or deny traffic.

The following figure illustrates the security groups in this (my) experiment:

security groups

There are 3 security groups here:

  • SG1: The RDS group, allowing incoming connections to port 3306 from SG2. 3306 is the standard MariaDB port.
  • SG2: The RDS client group. This group can query the RDS Maria DB using SQL. Any regular MariaDB client works.
  • SG3: Public SSH access group. Instances in this group allow connections to port 22 from the internet.

This nicely illustrates the concept of the "group" in a security group. The 3 instances in SG2 all share the same rules, and are allowed to connect to the RDS instance. Or, more specifically, to instances in the SG1 group. If I add more instances, or if the IP addresses of these 3 instances change, as long as they are in the security group, the rules will match fine.

Similarly, if I add more RDS instances, I can put them in the same SG1 group, and they will share the same properties. The SG2 instances can connect to them. Finally, if I want to add more instances accessible over SSH, I just set them up and add them to the security group SG3. As shown by the EC2 instance in the figure above, a single instance can also be a member of multiple security groups at once.

This seems like a nice and flexible way to manage such connections and permissions in an "elastic" cloud, which I guess is the point.

Lambda Policies

There are also two instances in my architecture figure that are not in any security group as they are not in a VPC. To belong to a security group, an instance has to be inside a VPC. The Scraper Lambda and the S3 Chat Logs bucket are outside the VPC. The connection from inside the VPC to S3 I already described earlier in this post (S3 endpoints). For the Scraper Lambda, Lambda policies are defined.

In fact, all Lambdas have such access policies defined in relation to the services they need to access, including the ones inside the VPC. The in-VPC ones just need to have the associated VPC mechanisms (security groups) enabled as well, since they also fall inside the scope of the VPC. There are some default policies, such as execution permissions for the Lambda itself. But also on resources it needs to access.

These are the policies I used for each of the Lambda here:

  • Scraper Lambda:

    • Lambda Invoke: Allows this Lambda to invoke the Timestamp and DB Insert Lambdas.
    • CloudWatch Logs: Every Lambda writes their logs to AWS CloudWatch.
    • S3 put objects: Allows this Lambda to write the scraping results to the S3 Chat Logs bucket.
    • S3 list objects: Just to check the bucket so it does not overwrite existing logs if somehow run multiple times per day.
  • DB Insert Lambda:

    • CloudWatch Logs: Logging as above
    • S3 List and Get Objects: For reading new log files created by the Scraper Lambda.
    • EC2 ENI interface create, list, delete: In-VPC Lambdas work by creating Elastic Network Interfaces withing the VPC so they can communicate with other in-VPC (and ext-VPC) instances. This enables that.
  • Timestamp Lambda:

    • CloudWatch Logs: Logging as above
    • EC2 ENI interfaces for in-VPC Lambda, as above.

As these show, the permissions can be defined at very granular level or at a higher level. For example, full access to S3 and any bucket, or read access to specific files in a specific bucket. Or anything in between.

Backups and Data Retention

One thing with databases is always backups. With AWS RDS, there are a few options. One is the standard backups offered by Amazon. Your RDS gets snapshotted to S3 daily. How this actually works sounds very simple but gets a bit complicated if you really try to understand it. What doesn’t…

So the documentation says "The first snapshot of a DB instance contains the data for the full DB instance" and further "Subsequent snapshots of the same DB instance are incremental, which means that only the data that has changed after your most recent snapshot is saved.".

Sounds great, doesn’t it? But think about it for a minute. You can set your backup retention period to be between 0 to 35 days (currently anyway). Default being 7. Now imagine you live all the way up to the 8th day when your first day backup expires. Consider that only day 0 was a full backup snapshot, and day 0 expires. Does everything else then build incremental snapshots on top of a missing baseline after day 0 expires?

Luckily, as usual, someone thought about this already and StackOverflow comes to the rescue. An RDS "instance" as referred to in the AWS documentation must be referencing the EC2 style VM instance that hosts the RDS. So the backup is not just data but the whole instance. And the instance is stored on Elastic Block Storage (EBS). I interpret this to mean you are not really backing up the database but the EBS volume where the whole RDS system is on. And then you can go read up on how AWS manages the EBS backups. Which mostly confirms the StackOverflow post.

Regarding costs, if it is an "instance snapshot", does the whole instance size count towards the cost? I guess it not, as you get the same size of "free" backup storage as you allocate for your RDS. In the free tier you get up to 20GB of RDS storage included, and by definition also up to 20GB of free RDS backup snapshot space. The free backup size always matches the storage size. If this included the whole instance in the calculation, the operating system and database software would likely take many GB already. But what do I know. As for where are the snapshots stored? Can I go have a look? Again, StackOverflow to the rescue. It is in an S3 bucket, but that bucket is not in my control and I have no way to see in it.

In any case, you get your RDS size worth of backup space included, whatever the size here means (instance vs data). And if you use the default 7 days period, it means you will have to fit all the 7 incremental snapshots in that space if you do not wish to pay extra. And the snapshots are stored as "blocks", and only the ones not referenced by existing snapshots are not deleted when an old (or the origin) snapshot is deleted. So expiring day 0 does not cause the incremental delta snapshots to break. It just deletes the blocks not referenced after the expired date.

Still, there is more. The AWS documentation on backup restoration mentions you can do point-in-time restoration typically up to 5 minutes before current time. If automated snapshots are only taken once per day, what is this based on? It is because AWS also uploads the RDS transaction logs to S3 every 5 minutes.

Beyond regular backups, there is the multi-AZ deployment of RDS, and read replicas. Both of those links contain a nice comparison table between the two. The multi-AZ is mainly for disaster recovery (DR), doing automated failover to another availability zone in case of issues. A read replica allows scaling reads via multiple asynchronously synced copies that can be deployed across regions and availability zones. A read-replica can also be manually promoted to master status. It all seems to get complicated, deeper than I wanted to go here. I guess you need an expert on all this to really understand what gets copied where, when, and how secure the failover is, or how secure the data is from loss in all cases.

After this hundred lines of digressing from my experiment, what did I actually use? Well I used a single-AZ deployment of my RDS and disabled all backups completely. Nuts, eh? Not really, considering that my architecture is built to collect all the scraped data into S3 buckets, and from there inserted into RDS by another Lambda function. So all the data is actually already backed up in S3, in form of the scraped files. Just re-run the import from the beginning for all the scraped files if needed (to rebuild the RDS).

Given how the RDS backups and my implementation all depend so heavily on S3, it seems very relevant to understand the storage and replication, reliability, etc. of S3.

S3 Reliability and Lifecycle Costs

The expected durability for S3 objects is given as 99.999999999%. AWS claims about 1 in 10 million objects would be lost during 10000 years. Not sure how you might test this. However, this is defined to hold for all the S3 tiers, which are:

  • standard: low latency and high throughput, no minimum storage time or size, survives AZ destruction, 99.99% availability
  • standard infrequently access (IA): Same throughput and latency as standard, lower storage cost but higher retrieval cost. 99.9% availability target, 30 days minimum billing.
  • one zone IA: same as standard IA, but only in one AZ, slighty cheaper storage, same retrieval cost, 99(.5)% availability, 30 days minimum billing
  • intelligent tiering: for data with changing access patterns. Automatically moves data to IA tier when not accessed for 30 days, and back to standard when accessed. 99.9% availability target, 30 days minimum billing.
  • glacier: very low storage price, higher retrieval prices (tiered on how fast you want it), 99.99% availability, 90 days minimum,
  • glacier deep archive: like glacier but slower and cheaper, 99.9% availability target, 180 days minimum.

Naturally some of these different tiers make more sense at different times of the object lifecycle. So AWS allows you to define automated transitions between them. These are called Lifecycle Rules. AWS examples are one good way to explain them.

Free tier does not seem to include other than some use of the S3 standard tier, but just to try this out in a bit more realistic fashion, I defined a simple lifecycle pipeline for my log files as illustrated here:

s3 lifecycle

I did not actually implement the final Glacier transition, as it has such a long minimum storage time and I want to be able to terminate my experiment in a shorter duration.

It is also possible to define prefix filters and tag filters to select the objects for which the defined rules apply to. Prefix filters can be such as "logs1/" to match all objects that are placed under the "logs1" folder. S3 does not actually have real folder-style hierarchical structure, but naming files/objects like this makes it treat them like "virtual folders". So I defined such a prefix filter, just because it is nice to experiment and learn. Besides filename filters, one can also define tag filters in the form of key/value pairs for tags.

So my defined S3 lifecycle rules in this case, and reasoning for them:

  • Transition from standard to standard-IA after 30 days. Let’s me play with the data if the import has issues for a few days. After that it should be in the RDS and just keep it around in cheaper tier "just in case". Well that and 30 days was the minimum AWS allowed me to set.
  • Filter by the prefix "logs1/" as that was the path I used. I used the path simply to give me some granular control over time, as this allowed simple time-based filtering in the API queries if I would use "logs2/" after a year or so. Would need to update this transition rule then, or simply set it to "log" prefix at that time.
  • I did not define data expiration time. The idea would be to use this type of data for training machine learning systems. You would want to maximize that data, and enable experiments later. Not that I expect to build such real systems on all this, but in a real scenario I think this would also make sense, so trying to stay there.
  • Transition to Glacier maybe after 2 months? Just a possibile though. No idea. But some discussions online led me to understand there is a minimum time interval before one can "Glacier" objects. Similar to the 30 day minimum I hit on S3 standard -> S3 IA. If it is also 30 days for Glacier, that could make it 30+30 = about 2 months.

Random Thoughts

Initially I though security groups were a bit weird and unnecessarily complex, compared to fiddling with your private computers and networks. But with all this, I realized the "elastic" nature of AWS actually fits this quite well. It allows the security definitions to live with the dynamic cloud via the group associations.

Regarding this, Network Access Control Lists (NACL) would allow more fine-grained traffic rules at subnet levels. This is still a bit fuzzy part for me, as I did not need to go into such details in my limited experiment. But in skilled hands it seems to be quite useful. Maybe more for a security/network specialist.

Lambda functions that are not explicitly associated to the VPC are still associated to a VPC. This is simply some kind of "special" [AWS controlled VPC](to a VPC). Makes me wonder a bit about what are all the properties of this "special" VPC, but I guess I just have to "trust AWS" again.

Looking at the architecture I used, the only thing in the public subnet is the EC2 instance I used to run my manual SQL queries against my RDS instance. Could I just drop the EC2 instance and as a result delete the whole public subnet and the Internet Gateway? The "natural" way that comes to mind is having access through an internet-connected Lambda function. An internet search for "AWS lambda shell" gives some interesting hits.

Someone used a Lambda shell to get access into the Lambda runtime environment, and download the AWS runtime code. Another similar experiment is hosted on Github, providing full Cloudformation templates to make it run as well. Finally, someone set up a Lambda shell in a browser, providing a bounty for anyone who manages to hack their infrastructure. Interesting..

On a related note, a NAT GW always requires an IGW. So if I wanted a private subnet to have access to the internet, I would still need a public subnet, even if it was otherwise empty besides the NAT-gateway. And while the NAT-GW is advertised as autoscaling and all that, I still need a NAT GW in each AZ used. But just one per AZ, of course.

Something I got very confused about are the EC2 instance storage types. There is instance store, which is "ephemeral", meaning it disappears when the VM is stopped or terminated. I always thought of this as a form of a local hard disk. Similar to how my laptop has a hard disk (or SSD…) inside. And the AWS docs actually describe it similarly as "this storage is located on disks that are physically attached to the host computer". Not too complicated?

But what is the other main option, Elastic Block Store (EBS)? The terms "block" and "storage" or "disk" bring to my mind the traditional definition of hard disks, with sectors hosting blocks of data. But it makes no sense, as EBS is described as virtual, replicated, highly available, etc.. A basic internet search also brings up similar, rather ambiguous definitions.

Some searching later, I concluded this is referring to the networked disk terminology, where block storage seems to have its own definition. So racks of disks, connected via Storage Array Network (SAN) technologies. As AWS advertises this with all kinds of fancy terminology, I guess it must be quite highly optimized, otherwise networked disks spreading data across physical hosts would seem slow to me. Probably just taking some of the well refined products in the space and turning it into an integrated part of AWS, as an "Elastic" branded service. But such is progress, it’s not the 80s disks as it used to be, Teemu. This definition makes sense considering what it is supposed to be as well. While the details are somewhat lacking, a Quora post provided some interesting insights.

I used RDS in this experiment to store what is essentially natural language text data. This is not necessarily the best option for this type of data. Rather something like Elasticsearch and its related AWS hosted service would be a better match. I simply picked RDS here to give myself a chance to try it out as a service, and since I don’t expect to store gigabytes of text in my experiment, or run complex searches over it. SQL is quite a simple and well tried out language, so it was easy enough for me to play with it and see everything was working. However, for a more real use case I would transition to Elasticsearch.

Despite all the fancy stuff, "Elastic" services, and all, some things on the AWS platform are surprisingly rigid still. I could not find a way to rename a change descriptions for Lambda functions or Security Groups. It seems I am not the only one who has thought about either of these over time. No wonder, as it seems rather basic functionality.

Overall, with my experience setting all this up, I used to think DevOps meant DEVelopment and OPerationS working closely together. Looking at this as well as all the trendy "infrastructure as code" lately, I am leaning more towards the side of DevOps referring to making the developers do the job of operations in addition to developing the system/service/program running on the infrastructure.. That’s just great…

Next I will look into making use of this type of collected data to build some service on AWS. Probably an API-Gateway and Lambda based service exposing a machine learning based model trained on the collected training data. In this case, I think I will look into using existing Twitter datasets to avoid considerations in all the aspects of actual data collection and use. But that would be another post, and I hope to also look into another one for how one might set up a dev/test environment for AWS style services. Later…

The code for this post is available on my related Github project. More specifically, the Lambda functions are at: