Investigating ROC/AUC

Executive Summary

ROC and AUC are terms that often come up in machine learning, in relation to evaluating models. In this post, I try examine what ROC curves actually are, how they are calculated, what is a threshold in ROC curve, and how it impacts the classification if you change it. The results show that using low threshold values leads to classifying more objects into positive category, and high threshold leads to more negative classifications. The Titanic dataset is used here as an example to study the threshold effect. A classifier gives probabilities for people to survive.

By default, the classifier seems to use 50% probability as a threshold to classify one as a survivor (positive class) or non-survivor (negative class). But this threshold can be varied. For example, using a threshold of 1% leads to giving a survivor label to anyone who gets a 1% or higher probability to survive by the classifier. Similarly, a threshold of 95% requires people to get at least 95% probability from the classifier to be given a survivor label. This is the threshold that is varied, and the ROC curve visualizes how the change in this probability threshold impacts classification in terms of getting true positives (actual survivors predicted as survivors) vs getting false positives (non-survivors predicted as survivors).

For the loooong story, read on. I dare you. ūüôā

Introduction

Area under the curve (AUC) and receiver operator characteristic (ROC) are two terms that seem to come up a lot when learning about machine learning. This is my attempt to get it. Please do let me know where it goes wrong..

AUC is typically drawn as a curve using some figure like this (from Wikipedia):

Roccurves
FIGURE 1. Example ROC/AUC curve.

It uses true positive rate (TPR) and false positive rate (FPR) as the two measures to compare. A true positive (TP) being a correct positive prediction and a false positive (FP) being a wrong positive prediction.

There is an excellent introduction to the topic in the often cited An introduction to ROC analysis article by Tom Fawcett. Which is actually a very understandable article (for most part), unlike most academic articles I read. Brilliant. But still, I wondered, so lets see..

To see ROC/AUC in practice, instead of just reading about it all over the Internet, I decided to implement a ROC calculation and see if I can match it with the existing implementations. If so, that should at least give me some confidence on my understanding on the topic being correct.

Training a classifier to test ROC/AUC

To do this, I needed a dataset and a fitting of some basic classifier on that data to run my experiments. I have recently been going through some PySpark lectures on Udemy, so I could maybe learn some more big data stuffs and get an interesting big data job someday. Woohoo. Anyway. The course was using the Titanic dataset, so I picked that up, and wrote a simple classifier for it in Pandas/Scikit. Being a popular dataset, there is also a related Kaggle for it. Which is always nice, allowing me to create everything a bit faster by using the references, and focus on the ROC investigation faster.

The classifier I used is Logistic Regression, and the notebook is available on my GitHub (GitHub sometimes seems to have issues rendering but its there, will look into putting also on Kaggle later). Assuming some knowledge of Python, Pandas, sklearn, and the usual stuff, the training of the model itself is really the same as usual:

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=2)
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

The above is of course just a cut off into the middle of the notebook, where I already pre-processed the data to build some set of features into the features variable, and took the prediction target (survived) into the target variable. But that is quite basic ML stuff you find in every article on the topic, and on many of the “kernels” on the Kaggle page I linked. For more details, if all else fails, check my notebook linked above.

The actual training of the classifier (after preprocessings..) is as simple as the few lines above. That’s it. I then have a Logistic Regression classifier I can use to try and play with ROC/AUC. Since ROC seems to be about playing with thresholds over predicted class probabilities, I pull out the target predictions and their given probabilities for this classifier on the Titanic dataset:

predictions = logreg.predict(X_test)
predicted_probabilities = logreg.predict_proba(X_test)

Now, predicted_probabilities holds the classifier predicted probability values for each row in the dataset. 0 for never going to survive, 1 for 100% going to survive. Just predictions, of course, not actual facts. That’s the point of learning a classifier, to be able to predict the result for instance where you don’t know in advance, as you know.. ūüôā

Just to see a few things about the data:

predicted_probabilities.shape

>(262, 2)

X_test.shape

>(262, 9)

This shows the test set has 262 rows (data items), there are 9 feature variables I am using, and the prediction probabilities are given in 2 columns. The classifier is a binary classifier, giving predictions for a given data instance as belonging to one of the two classes. In this case it is survived and not survived. True prediction equals survival, false prediction equals non-survival. The predicted_probability variable contains probabilities for false in column 0 and true in column 1. Since probability of no survival (false) is the opposite of survival (1-survival_probability, (true)), we really just need to keep one of those two columns. Because if it is not true, it has to be false. Right?

Cutting the true/false predictions to only include the true prediction, or the probability of survival (column 1, all rows):

pred_probs = predicted_probabilities[:, 1]

Drawing some ROC curves

With this, getting the ROC curve from sklearn is as simple as:

[fpr, tpr, thr] = roc_curve(y_test, pred_probs)

The Kaggle notebook I used as a reference, visualizes this and also drew fancy looking two dashed blue lines to illustrate a point on the ROC curve, where 95% of the surviving passengers were identified. This point is identified by:

# index of the first threshold for which the TPR > 0.95
idx = np.min(np.where(tpr > 0.95))

So it is looking for the minimum index in the TPR list, where the TPR is above 95%. This results in a ROC curve drawing as:

plt.figure(figsize=(10, 6), dpi=80)
plt.plot(fpr, tpr, color='coral', label='ROC curve (area = %0.3f)' % auc(fpr, tpr))
plt.plot([0, 1], [0, 1], 'k--')
plt.plot([0,fpr[idx]], [tpr[idx],tpr[idx]], 'k--', color='blue')
plt.plot([fpr[idx],fpr[idx]], [0,tpr[idx]], 'k--', color='blue')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate (1 - specificity)', fontsize=14)
plt.ylabel('True Positive Rate (recall)', fontsize=14)
plt.title('Receiver operating characteristic (ROC) curve')
plt.legend(loc="lower right")
plt.show()

The resulting figure is:

roc0
FIGURE 2. The ROC curve I nicked from Kaggle.

How to read this? In bottom left we have zero FTP, and zero TPR. This means everything is predicted as false (no survivors). This would have threshold of 100% (1.0), meaning you just classify everything as false because no-one gets over 100% probability to survive. In top right corner both TPR and FPR are 1.0, meaning you classify everyone as a survivor, and no-one as a non-survivor. So false positives are maximized as are true positives, since everyone is predicted to survive. This is a result of a threshold of 0%, and everyone gets a probability higher than 0% to survive. The blue lines indicate a point where over 95% of true positives are identified (TPR value), simultaneously leading to getting about 66% FPR. Of course, the probability threshold is not directly visible in this but has to be otherwise looked up, as we shall see in all the text I wrote below.

To see the accuracy and AUC calculated for the base (logistic regression) classifier, and the AUC calculated for the ROC in the figure above:

print(logreg.__class__.__name__+" accuracy is %2.3f" % accuracy_score(y_test, predictions))
print(logreg.__class__.__name__+" log_loss is %2.3f" % log_loss(y_test, pred_probs))
print(logreg.__class__.__name__+" auc is %2.3f" % auc(fpr, tpr))

>LogisticRegression accuracy is 0.832
>LogisticRegression log_loss is 0.419
>LogisticRegression auc is 0.877

And some more statistics from sklearn to compare against when trying to understand it all:

confusion_matrix(y_test, predictions)

>array([[156,  19],
>       [ 25,  62]])

precision_score(y_test, predictions)

>0.7654320987654321

accuracy_score(y_test, predictions)

>.8320610687022901

In the confusion matrix, there are 156 correct non-survivor predictions (true negatives), 19 wrong ones (false negatives). 62 correct survivor predictions (true positives), 25 wrong ones (false positives). The precision of the classifier is given as 0.7654320987654321 and accuracy as 0.8320610687022901.

To evaluate my ROC/AUC calculations, I first need to try to understand how the algorithms calculate the metrics, and what are the parameters being passed around. I assume the classification algorithms would use 0.5 (or 50%) as a default threshold to classify a prediction with a probability of 0.5 or more as 1 (predicted true/survives) and anything less than 0.5 as 0 (predicted non-survivor/false). So let’s try to verify that. First, to get the predictions with probability values >= 0.5 (true/survive):

#ss is my survivor-set
ss1 = np.where(pred_probs >= 0.5)
print(ss1)

>(array([  0,   1,   3,   4,   8,  13,  16,  26,  27,  30,  32,  42,  43,
>         46,  49,  50,  57,  59,  66,  67,  76,  77,  79,  83,  84,  85,
>         86,  92,  93,  98, 101, 109, 111, 113, 117, 121, 122, 125, 129,
>        130, 133, 137, 140, 143, 145, 147, 148, 149, 150, 152, 158, 170,
>        175, 176, 177, 180, 182, 183, 184, 186, 201, 202, 204, 205, 206,
>        211, 213, 215, 218, 227, 230, 234, 240, 241, 242, 246, 248, 249,
>        250, 252, 256]),)

len(ss1[0])

>81

The code above filters the predicted probabilities to get a list of values where the probability is higher than or equal to 0.5. At first look, I don’t know what the numbers in the array are. However, my guess is that they are indices into the X_train array that was passed as the features to predict from. So at X_train indices 0,1,3,4,8,… are the data points predicted as true (survivors). And here we have 81 such predictions. That is the survivor predictions, how about non-survivor predictions?

Using an opposite filter:

ss2 = np.where(pred_probs 181

Overall, 81 predicted survivors, 181 predicted casualities. Since y_test here has known labels (to test against), we can check how many real survivors there are, and what is the overall population:

y_test.value_counts()

>0    175
>1     87

So the amount of actual survivors in the test set is 87 vs non survivors 175. To see how many real survivors (true positives) there are in the predicted 81 survivors:

sum(y_test.values[i] for i in ss1[0])

>62

The above is just some fancy Python list comprehension or whatever that is for summing up all the numbers in the predicted list indices. Basically it says that out of the 81 predicted survivors, 62 were actual survivors. Matches the confusion matrix from above, so seems I got that correct. Same for non-survivors:

sum(1 for i in ss2[0] if y_test.values[i] == 0)

>156

So, 156 out of the 181 predicted non-survivors actually did not make it. Again, matches the confusion matrix.

Now, my assumption was that the classifier uses the threshold of 0.5 by default. How can I use these results to check if this is true? To do this, I try to match the sklearn accuracy score calculations using the above metrics. Total correct classifications from the above (true positives+true negatives) is 156+62. Total number of items to predict is equal to the test set size, so 262. The accuracy from this is:

(156+62)/262

>0.8320610687022901

This is a perfect match on the accuracy_score calculated above. So I conclude with this that I understood the default behaviour of the classifier correct. Now to use that understanding to see if I got the ROC curve correct. In FIGURE 1 far above, I showed the sklearn generated ROC curve for this dataset and this classifier. Now I need to build my own from scratch to see if I  get it right.

Thresholding 1-99% in 1% increments

To build one a ROC curve myself using my own threshold calculations:

tpr2 = []
fpr2 = []
#do the calculations for range of 1-100 percent confidence, increments of 1
for threshold in range(1, 100, 1):
    ss = np.where(pred_probs >= (threshold/100))
    count_tp = 0
    count_fp = 0
    for i in ss[0]:
        if y_test.values[i] == 1:
            count_tp += 1
        else:
            count_fp += 1
    tpr2.append(count_tp/87.0)
    fpr2.append(count_fp/175.0)
    print("threshold="+str(threshold)+", count_tp="+str(count_tp)+", count_fp="+str(count_fp))

The above code takes as input the probabilities (pred_probs) given by the Logistic Regression classifier for survivors. It then tries the threshold values from 1 to 99% in 1% increments. This should produce the ROC curve points, as the ROC curve should describe the prediction TPR and FPR with different probability thresholds. The code gives a result of:

threshold=1, count_tp=87, count_fp=175
threshold=2, count_tp=87, count_fp=174
threshold=3, count_tp=87, count_fp=174
threshold=4, count_tp=87, count_fp=173
threshold=5, count_tp=87, count_fp=171
threshold=6, count_tp=87, count_fp=169
...
threshold=94, count_tp=3, count_fp=1
threshold=95, count_tp=1, count_fp=0
threshold=96, count_tp=0, count_fp=0
threshold=97, count_tp=0, count_fp=0
threshold=98, count_tp=0, count_fp=0
threshold=99, count_tp=0, count_fp=0

At 1% threshold, everyone that the classifier gave a probability of 1% or higher to survive is classified as a likely survivor. In this dataset, everyone is given 1% or more survival probability. This leads to everyone being classified as a likely survivor. Since everyone is classified as survivor, this gives true positives for all 87 real survivors, but also false positives for all the 175 non-survivors. At 6% threshold, the FPR has gone down a bit, with only 169 non-survivors getting false-positives as survivors.

At the threshold high-end, the situation reverses. At threshold 94%, only 3 true survivors get classified as likely survivors. Meaning only 3 actual survivors scored a probability of 94% or more to survive from the classifier. There is one false positive at 94%, so one who was predicted to have a really high probability to survive (94%) did not. At 95% there is only one predicted survivor, which is a true positive. After that no-one scores 96% or more. Such high outliers would probably make interesting points to look into in more detail, but I won’t go there in this post.

Using the true positive rates and false positive rates from the code above (variables tpr2 and fpr2), we can make a ROC curve for these calculations as:

plt.figure(figsize=(10, 6), dpi=80)
plt.plot(fpr2, tpr2, color='coral', label='ROC curve (area = %0.3f)' % auc(fpr2, tpr2))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize=14)
plt.ylabel('True Positive Rate', fontsize=14)
plt.title('ROC test')
plt.legend(loc="lower right")
plt.show()

roc2
FIGURE 3. ROC curve for thresholds of 1-99% in 1% steps.

Comparing this with the ROC figure form sklearn (FIGURE 2), it is a near-perfect match. We can further visualize this by plotting the two on top of each other:

plt.figure(figsize=(10, 6), dpi=80)
plt.plot(fpr2, tpr2, color='coral', label='ROC curve (area = %0.3f)' % auc(fpr2, tpr2))
plt.plot(fpr, tpr, color='blue', label='ROC curve (area = %0.3f)' % auc(fpr, tpr))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize=14)
plt.ylabel('True Positive Rate', fontsize=14)
plt.title('ROC test')
plt.legend(loc="lower right")
plt.show()

roc3
FIGURE 4. My 1-99% ROC vs sklearn ROC.

There are some very small differences in the sklearn version having more “stepped” lines, whereas the one I generated from threshold of 1-99% draws the lines a bit more straight (“smoother”). Both end up with the exact same AUC value (0.877). So I would call this ROC curve calculation solved, and claim that the actual calculations match what I made here. Solved as in I finally seem to have understood what it exactly means.

What is the difference

Still, loose ends are not nice. So what is causing the small difference in the ROC lines from the sklearn implementation vs my code?

The FPR and TPR arrays/lists form the x and y coordinates of the graphs. So looking at those might help understand a bit:

fpr.shape

>(95,)

len(fpr2)

>99

The above shows that the sklearn FPR array has 95 elements, while the one I created has 99 elements. Initially I thought, maybe increasing detail in the FPR/TPR arrays I generate would help match the two. But it seems I already generated more points than is in the sklearn implementation. Maybe looking into the actual values helps:

fpr

array([0.        , 0.        , 0.00571429, 0.00571429, 0.01142857,
       0.01142857, 0.01714286, 0.01714286, 0.02285714, 0.02285714,
       0.02857143, 0.02857143, 0.03428571, 0.03428571, 0.04      ,
       0.04      , 0.04571429, 0.04571429, 0.05142857, 0.05142857,
       0.05714286, 0.05714286, 0.06285714, 0.06285714, 0.07428571,
       0.07428571, 0.07428571, 0.08      , 0.08      , 0.09142857,
       0.09142857, 0.09714286, 0.09714286, 0.10857143, 0.10857143,
       0.12      , 0.12      , 0.14857143, 0.14857143, 0.16      ,
       0.16      , 0.17142857, 0.17142857, 0.17142857, 0.18857143,
       0.18857143, 0.21142857, 0.21142857, 0.22285714, 0.22857143,
       0.23428571, 0.23428571, 0.24      , 0.24      , 0.26285714,
       0.28571429, 0.32571429, 0.33714286, 0.34857143, 0.34857143,
       0.37714286, 0.37714286, 0.38285714, 0.38285714, 0.38857143,
       0.38857143, 0.4       , 0.4       , 0.43428571, 0.45714286,
       0.46857143, 0.62857143, 0.62857143, 0.63428571, 0.63428571,
       0.64      , 0.64      , 0.64571429, 0.65714286, 0.66285714,
       0.66285714, 0.68      , 0.69142857, 0.72      , 0.73142857,
       0.73714286, 0.74857143, 0.79428571, 0.82285714, 0.82285714,
       0.85142857, 0.86285714, 0.88      , 0.89142857, 1.        ])

np.array(fpr2)

array([1.        , 0.99428571, 0.99428571, 0.98857143, 0.97714286,
       0.96571429, 0.96571429, 0.95428571, 0.92      , 0.89714286,
       0.84571429, 0.78857143, 0.66285714, 0.63428571, 0.58857143,
       0.52      , 0.49142857, 0.48      , 0.40571429, 0.39428571,
       0.38285714, 0.37714286, 0.35428571, 0.33714286, 0.32      ,
       0.31428571, 0.29714286, 0.26285714, 0.25142857, 0.24571429,
       0.23428571, 0.22857143, 0.21142857, 0.21142857, 0.20571429,
       0.20571429, 0.2       , 0.19428571, 0.18857143, 0.18857143,
       0.17714286, 0.17714286, 0.17142857, 0.16      , 0.14857143,
       0.13714286, 0.13142857, 0.12      , 0.10857143, 0.10857143,
       0.10857143, 0.09714286, 0.09714286, 0.09714286, 0.08571429,
       0.08571429, 0.08      , 0.07428571, 0.07428571, 0.06857143,
       0.06285714, 0.05714286, 0.05714286, 0.05142857, 0.05142857,
       0.04571429, 0.04      , 0.04      , 0.04      , 0.04      ,
       0.03428571, 0.02857143, 0.02857143, 0.02857143, 0.02857143,
       0.02285714, 0.02285714, 0.02285714, 0.01714286, 0.01714286,
       0.01142857, 0.01142857, 0.01142857, 0.01142857, 0.01142857,
       0.01142857, 0.00571429, 0.00571429, 0.00571429, 0.00571429,
       0.00571429, 0.00571429, 0.00571429, 0.00571429, 0.        ,
       0.        , 0.        , 0.        , 0.        ])

In the above, I print two arrays/lists. The first one (fpr) is the sklearn array. The second one (fpr2) is the one I generated myself. The fpr2 contains many duplicate numbers one after the other, whereas fpr has much more unique numbers. My guess is, the combination of fpr and tpr, as in the sklearn values might have only unique points, whereas fpr2 and tpr2 from my code has several points repeating over multiple times.

What causes this? Looking at sklearn roc_curve method, it actually returns 3 values, and I so far only used 2 of those. The return values are for variables fpr, tpr, thr. The thr one is not yet used and is actually named thresholds in the sklearn docs. What are these thresholds?

thr

array([0.95088415, 0.94523744, 0.94433151, 0.88735906, 0.86820046,
       0.81286851, 0.80847662, 0.79241033, 0.78157711, 0.76377304,
       0.75724659, 0.72956179, 0.71977863, 0.70401182, 0.70160521,
       0.66863069, 0.66341944, 0.65174523, 0.65024897, 0.6342464 ,
       0.63001635, 0.61868955, 0.61479088, 0.60172207, 0.59225844,
       0.58177374, 0.58027421, 0.57885387, 0.56673186, 0.54986331,
       0.54937079, 0.54882292, 0.52469167, 0.51584338, 0.50426613,
       0.48014018, 0.47789982, 0.45232349, 0.44519259, 0.44203986,
       0.43892674, 0.43378224, 0.42655293, 0.42502826, 0.40283515,
       0.39572852, 0.34370988, 0.34049813, 0.32382773, 0.32099169,
       0.31275499, 0.31017684, 0.30953277, 0.30348576, 0.28008661,
       0.27736946, 0.24317994, 0.24314769, 0.23412169, 0.23370882,
       0.22219555, 0.22039886, 0.21542313, 0.21186501, 0.20940184,
       0.20734276, 0.19973878, 0.19716815, 0.18398295, 0.18369904,
       0.18369725, 0.14078944, 0.14070302, 0.14070218, 0.13874243,
       0.13799297, 0.13715783, 0.13399095, 0.13372175, 0.13345383,
       0.13057344, 0.1279474 , 0.1275107 , 0.1270535 , 0.12700496,
       0.12689617, 0.12680186, 0.11906236, 0.11834502, 0.11440003,
       0.10933941, 0.10860771, 0.10552423, 0.10468277, 0.0165529 ])

This array shows how they are quite different and there is no set value that is used to vary the threshold. Unlike my attempt at doing it in 1% unit changes, this list has much bigger and smaller changes in it. Let’s try my generator code with this same set of threshold values:

tpr3 = []
fpr3 = []
#do the calculations for sklearn thresholds
for threshold in thr:
    ss = np.where(pred_probs >= threshold)
    count_tp = 0
    count_fp = 0
    for i in ss[0]:
        if y_test.values[i] == 1:
            count_tp += 1
        else:
            count_fp += 1
    tpr3.append(count_tp/87.0)
    fpr3.append(count_fp/175.0)
    print("threshold="+str(threshold)+", count_tp="+str(count_tp)+", count_fp="+str(count_fp))

And with the TPR and FPR lists calculated for these thresholds, we can visualize them as well and compare against the sklearn coordinates:

plt.figure(figsize=(10, 6), dpi=80)
plt.plot(fpr3, tpr3, color='coral', label='ROC curve (area = %0.3f)' % auc(fpr3, tpr3))
plt.plot(fpr, tpr, color='blue', label='ROC curve (area = %0.3f)' % auc(fpr3, tpr3))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize=14)
plt.ylabel('True Positive Rate', fontsize=14)
plt.title('ROC test')
plt.legend(loc="lower right")
plt.show()

Giving the figure:

roc5
FIGURE 5. Overlapping ROC curves with shared thresholds.

This time only one line is visible since the two are fully overlapping. So I would conclude that at least now I got also my ROC curve understanding validated. My guess is that sklearn does some nice calculations in the back for the ROC curve coordinates, to identify threshold points where there are visible changes and only providing those. While I would then use the ROC function that sklearn or whatever library I use (e.g., Spark) provides, this understanding I managed to build should help me make better use of their results.

Varying threshold vs accuracy

OK, just because this is an overly long post already, and I still cannot stop, one final piece of exploration. What happens to the accuracy as the threshold is varied? As 0.5 seems to be the default threshold, is it always the best? In cases where we want to minimize false positives or maximize true positives, maybe the threshold optimization is most obvious. But in case of just looking at the accuracy, what is the change? To see, I collected accuracy for every threshold value suggested by sklearn roc_curve(), and also for the exact 0.5 threshold (which was not in the roc_curve list but I assume is the classifier default):

tpr3 = []
fpr3 = []
accuracies = []
accs_ths = []

#do the calculations for sklearn thresholds
for threshold in thr:
    ss = np.where(pred_probs >= threshold)
    count_tp = sum(y_test.values[i] for i in ss[0])
    count_fp = len(ss[0]) - count_tp
    my_tpr = count_tp/87.0
    my_fpr = count_fp/175.0

    ssf = np.where(pred_probs < threshold)
    count_tf = sum(1 for i in ssf[0] if y_test.values[i] == 0)

    tpr3.append(my_tpr)
    fpr3.append(my_fpr)
    acc = (count_tp+count_tf)/y_test.count()
    accuracies.append(acc)
    accs_ths.append((acc, threshold))
    print("threshold="+str(threshold)+", tp="+str(count_tp)+", fp="+str(count_fp), ", tf="+str(count_tf), ", acc="+str(acc))

When the accuracy is graphed against the ROC curve, this looks like:

roc-acc
FIGURE 6. ROC thresholds vs Accuracy.

Eyeballing this, at the very left of FIGURE 6 the accuracy is equal to predicting all non-survivors correct (175/262=66.8%) but all survivors wrong. At the very right the accuracy is equal to predicting all survivors correct but all non-survivors wrong (87/262=33.2%). The sweet point is somewhere around 83% accuracy with about 65% true positives and maybe 8% false positives. I am just eyeballing this from the graph, so don’t take it to literally. To get a sorted list of accuracies by threshold:

plt.figure(figsize=(10, 6), dpi=80)
plt.plot(fpr3, tpr3, color='blue', label='ROC curve (area = %0.3f)' % auc(fpr3, tpr3))
plt.plot(fpr3, accuracies, color='coral', label='Accuracy' % auc(fpr3, tpr3))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize=14)
plt.ylabel('True Positive Rate', fontsize=14)
plt.title('ROC test')
plt.legend(loc="lower right")
plt.show()

And the sorted list itself:

accs_ths.sort(reverse=True)
accs_ths

results in:

  accuracy,           threshold,          tp, tn,  total correct
 (0.8396946564885496, 0.5802742149381693, 58, 162, 220),
 (0.8396946564885496, 0.5667318576914661, 59, 161, 220),
 (0.8358778625954199, 0.5788538683111758, 58, 161, 219),
 (0.8358778625954199, 0.5493707899389861, 60, 159, 219),
 (0.8358778625954199, 0.5246916651561455, 61, 158, 219),
 (0.8320610687022901, 0.6342464019774986, 52, 166, 218),
 (0.8320610687022901, 0.6186895526474085, 53, 165, 218),
 (0.8320610687022901, 0.6017220679887019, 54, 164, 218),
 (0.8320610687022901, 0.5817737394188721, 56, 162, 218),
 (0.8320610687022901, 0.549863313475955, 59, 159, 218),
 (0.8320610687022901, 0.548822920002867, 60, 158, 218),
 (0.8320610687022901, 0.5042661255726298, 62, 156, 218),
 (0.8320610687022901, 0.5000000000000000, 62, 156, 218),
 (0.8282442748091603, 0.6300163468361882, 52, 165, 217),
 (0.8282442748091603, 0.6147908791071057, 53, 164, 217),
 (0.8282442748091603, 0.5158433833082536, 61, 156, 217),
 (0.8282442748091603, 0.4778998233729275, 63, 154, 217),
 (0.8244274809160306, 0.5922584447936917, 54, 162, 216),
 (0.8244274809160306, 0.4801401821434857, 62, 154, 216),
 ...

Here a number of thresholds actually give a higher score than the 0.5 one I used as the reference for default classifier. The highest scoring thresholds being 0.5802 (58.02%) and 0.5667 (56.67%). The 0.5 threshold gets 218 predictions correct, with an accuracy of 0.8320 (matching the one from accuracy_score() at the beginning of this post). Looking at the thresholds and accuracies more generally, the ones slightly above the 0.5 threshold generally seem to do a bit better. Threshold over 0.5 but less than 0.6 seems to score best here. But there is always an exception, of course (0.515 is lower). Overall, the difference between the values is small but interesting. Is it overfitting the test data when I change the threshold and evaluate against the test data? If so, does the same consideration apply for optimizing for precision/recall using the ROC curve? Is there some other reason why the classifier would use 0.5 threshold which is not optimal? Well, my guess is, it might be a minor artefact of over/underfitting. No idea, really.

Area Under the Curve (AUC)

Before I forget. The title of the post was ROC/AUC. So what is area under the curve (AUC)? The curve is the ROC curve, the AUC is the area under the ROC curve. See FIGURE 7, where I used the epic Aseprite to paint the area under the FIGURE 5 ROC curve. Brilliant. AUC refers to the area covered by this part I colored, under the ROC curve. The value of AUC is calculated as the fraction of the overall area. So consider the whole box of FIGURE 7 as summing up to 1 (TPR x FPR = 1 x 1 = 1), and in this case AUC is the part marked in the figure as area = 0.877. AUC is calculated simply by calculating the size of the area under the curve vs the full box (at full size 1).

roc5_aucFIGURE 7. Area Under the Curve.

I believe the rationale is, the more the colored area covers, or the bigger the AUC value, the better overall performance one could expect from the classifier. As the area grows bigger, the more the classifier is able to separate true positives from false positives at different threshold values.

Uses for AUC/ROC

To me, AUC seems most useful in evaluating and comparing different machine learning algorithms. Which is also what I recall seeing it being used for. In such cases, the higher the AUC, the better overall performance you would get from the algorithm. You can then boast in your paper about having a higher overall performance metric than some other random choice.

ROC I see as mostly useful for providing extra information and an overview to help evaluate options for TPR vs FPR in different threshold configurations for a classifier. The choice and interpretation depends on the use case, of course. The usual use case in this domain is doing a test for cancer. You want to maximize your TPR so you miss out on fewer people with cancer. You can then look for your optimal location of the ROC curve to climb onto, with regards to the cost vs possibly missed cases. So you would want a high TPR there, as far as you can afford I guess. You might have a higher FPR but such is the tradeoff. In this case, the threshold would likely be lower rather than higher.

It seems harder to find examples of optimizing for low FPR with the tradeof being lower TPR as well. Perhaps one could look onto the Kaggle competitions, and pick, for example, the topic of targeted advertising. For lower FPR, you could set a higher threshold rather than lower. But your usecase could be pretty much anything, I guess.

Like I said, recently I have been looking into Spark and some algorithms seem to only give out the estimation as an AUC metric. Which is a bit odd but I guess there is some clever reason for that. I have not looked too deep into that aspect of Spark yet, probably I am just missing the right magical invocations to get all the other scores.

Some special cases are also discussed, for example, in the Fawcett paper. At some point in the curve, one classifier might have a higher point in ROC space even if having overall lower AUC value. So that some threshold would have higher value on one classifier, while other thresholds lower for the same (classifier) pair. Similarly, AUC can be higher overall but a specific classifier still better for a specific use case. Sounds a bit theoretical, but interesting.

Why is it called Receiver Operator Characteristic (ROC)?

You can easily find information on the ROC curve history referencing their origins in world war 2 and radar signal detection theory. Wikipedia ROC article has some start of this history, but is quite short. As usual, Stackexchange gives a good reference. The 1953 article seems paywalled, and I do not have access. This short description describes it as being used to measure the ability of a radio receiver to produce quality readings and enabling the operator to distinguish between false positives and true positives.

Elsewhere I read it originated from Pearl Harbour attack during WW2, where they tried to analyze why the radar operators failed to see incoming attack aircraft. What do I know. The internet is full of one-liner descriptions on this topic, all circling around the same definitions but never going into sufficient details.

Conclusions

Well the conclusion is that this was way too long post on a simple term. And trying to read it, it is not even that clearly written. But that is what I got. It helped me think about the topic and clarify and verify what it really is. Good luck.

Advertisements

PoW vs PoS. Whut?

Recently I have tried to understand blockchains and how they actually work. It seems the internet is full of high-level descriptions that always stop before actually explaining how the different blockchain variants really work. What is proof-of-work in practice, and what does proof-of-stake actually mean? What actually happens when you deploy a blockchain node in different variants, how does the blockchain “network” reach “consensus”? What is the blockchain “network”? What is “concensus”? What is “byzantine fault tolerance” and how does it relate to blockchains?

I started with looking into Bitcoin, since it is the most popular and has the most information available. There is a nice book about it even, with reasonable amounts of implementation level details as well. The book is named “Mastering Bitcoin: Programming the Open Blockchain 2nd Edition”. Just in case the Amazon link breaks. I am sure there are many other great books and resources but I really liked that one. The Naivecoin tutorial I discussed before is also good stuff. So generally, resources are there for understanding Bitcoin style PoW blockchains, also at more detailed level. But let’s see what I think, and you can then correct me. I start with PoW and then give some PoS assumptions.

Proof of Work (PoW)

Proof of Work simply refers to running a specific hash function for long enough to find a set of values that produces a hash value matching some given criterion. An example using a generic block structure, mostly imitating Bitcoin block structure:

  • prev_hash: Previous block hash
  • merkle_root: Hash for data in this block (Merkle root)
  • timestamp: Current timestamp
  • diff_target: Difficulty target
  • nonce: A value that is changed to try to find different hashes
  • transaction_list: List of transactions included in this block

I use the above properties as an example for hashing. Additionally, there would be a number of more generic meta-data, such as magic number, block size, version. Such metadata are used in parsing to find expected data size of block (e.g., bytes to read from socket), and to verify the block data is valid (correct magic number, version number, ..). For more details on actual Bitcoin block elements, see e.g., here.

To build a hash for PoW, the above list of data can be all hashed together. So it becomes something like this:

block_hash = SHA256(prev_hash+merkle_root+timestamp+diff_target+nonce+transaction_list)

This is slightly simplified, but serves the purpose. Here SHA256 is a specific hash function, and various other hash functions are often used as well. The resulting block_hash can be used to verify the contents of the block, since to get the same hash value you need to have the exact same block contents as input for the hash function. This protects from anyone changing a processed block (e.g., modifying existing transactions already in the blockchain). Since the attacker would have to be able to re-calculate a new hash for this block, as well as all blocks coming after it (so the longer/deeper the block is in the chain, the “safer” it is from modification as it is buried deeper under new blocks).

The blockchain itself is just a long list of blocks, each referencing to the previous block (with prev_hash). The term chain refers to each block linking to previous one this way. This connection is enforced by including the prev_hash in the next block along with block_hash of the block itself. So you cannot just change the order of the blocks in the chain, as the linked prev_hash values would no longer match.

Look, I made a pretty picture:

chain

Here, the block itself contains the block_hash, prev_hash, and block_data. So block_hash is calculated over all the rest of the data in the block (including prev_hash). Prev_hash links the block to the previous block in the chain, as it is the block_hash of the previous block. Block_data contains all the rest of the data, and is of course included in the block_hash, meaning block_hash verifies the integrity of all this data. Because changing block_data (or prev_hash) would change block_hash, and thus invalidate the whole chain (links would no longer be valid as prev_hash would differ from block_hash of previous block). Cute.

Anyway. What is proof of work? Anyone can calculate a block hash using the above hashing formula. But not just any hash will be accepted by the blockchain network.  The acceptance criteria is defined in the software designed to run the blockchain network. The most common one is the difficulty based criteria. The Naivecoin tutorial has a good and practical description of one. The difficulty is calculated using an algorithm with several parameters. A typical example:

  • current_difficulty: Current mining (PoW) difficulty.
  • adjustment_interval: Number of blocks to produce until difficulty should be adjusted.
  • target_interval: The target time that should be spent between generating blocks.

The blockchain always targets to produce blocks with an interval time of target_interval between them. For example, Bitcoin uses a target_interval value of 10 minutes, and a adjustment_interval of 2016 blocks. With a good difficulty value, 2016 blocks should take 2 weeks to generate. This is 2016*10/60/24=14 days. 2016 is the number of blocks in adjustment interval, 10 is for 10 minutes per block, 60 is for minutes in hour, 24 for hours in a day. If mining 2016 blocks takes less time than 14 days, the difficulty will be adjusted higher. If the mining time is more than 14 days, the difficulty is adjusted lower. This way the blockchain network adjusts its parameters over time to keep the amount of coins mined close to the specified rate. Many blockchains use shorter than 10 minute time_interval and shorter than 14 days adjustment_interval. But generally, the idea is again the same. This is the basic idea, which can be built on by other difficulty adjustment algorithms such as LWMA, which seems to be quite popular with many altcoins currently.

The process of mining in PoW systems is to try hashing different block configurations (thus different hash inputs), until you get one that passes the difficulty check. The nonce parameter is there specifically for this purpose, to let miners try different values to get a hash that passes the difficulty check. Additionally, one can play to some extent with the choice of transactions to include as well as the exact timestamp used. These all change the input to the hash function, and thus the produced hash value. But nonce is the intended main value to change. Additional considerations can also be applied, such as extra_nonce etc.  but the main concept is the same.

What is the difficulty exactly? It is simply a reference to a hash value that needs to be found. With higher difficulty, the target hash value is set to be harder to find. With lower difficulty, it is set to be easier to find. This is based on the fact that a hash value is just a very large number, even if usually represented as a hex-string. Examples:

  • Difficulty target set to 100k: You need to find a hash value that is less than 100k.
  • Difficulty target set to 1M: You need to find a hash value that is less than 1M.

Out of these two, the 100k difficulty target is higher in difficulty, because there are much fewer possible target values (100000). 1M is a much “easier” difficulty target since there are 10x the possible accepted hash output values (100000 vs 1 million). Thus, a higher difficulty target means a lower difficulty (i.e., easier to find a suitable hash and thus create a valid block). Most confusing, I am sure. There can be variants of this, such as using a number of leading zeroes required for the hash output instead of under a specific number, but the concept is still the same.

The hashing algorithm presented far above (block_hash=…) produces a single value. And we cannot control what value it produces, since hash-functions are one-way and no-one has been able to figure a way to reverse them (directly generate input to give specific hash output). So the only way to find a hash matching the target difficulty is to keep trying different input combinations until we hit one which produces a suitable hash value (below the minimum difficulty target). The “Proof of Work” is then just proof that you have generated huge numbers of these hashes, enough to find one matching the difficulty criteria. Hashing is the work, the proof is producing a hash matching the difficulty target.

Peer to Peer (P2P) consensus

Once someone finds a block (produces an accepted hash..), the block needs to be added to the blockchain. This happens when the finding node propagates the new block with the valid block hash to its peers, which propagate it further to their peers, and so on for the whole network. The peers can all check that the data in the block produces a valid hash in relation to the difficulty target, and that the data in the block is verified (prev_hash matches the block_hash on previous block added to the chain, transactions in transaction_list are valid, etc.). This of course requires all nodes to be connected in a peer-to-peer fashion.

The peer-to-peer network is established and continously updates itself over time. When a blockchain node boots up, it typically establishes a connection to the blockchain network via a pre-configured set of seed nodes. These can be static IP addresses stored in the client, or a set of pre-defined DNS server addresses. Once the node has connected this way to its first set of peers, these peers provide it with additional peer nodes to connect to. The node can request more peers from these peers, and so on. Connections to nodes may drop over time, and the node can use peer lists it receives from others to find new ones.

The initial synchronization of the blockchain state happens by the nodes sending their confirmed blocks, transactions, and other blockchain state information to each other and any new peer joining the network. Once nodes are synchronized, the new mined blocks are added to the blockchain as soon as some node finds them. New “money” is created by the mining node adding a “coinbase” transaction to any new block it finds (gets an accepted hash for), giving itself the number of coins agreed in the network specification for finding a new block.

So this is how distributed consensus is achieved in PoW networks. But how about PoS networks? Not too different (or so I think), but the way of creating and assigning new blocks has to be different. That is the whole point, of course..

Proof of Stake (PoS)

The above information on Proof of Work is actually relatively simple to find on the internets and books. The Bitcoin book I linked above is one great source for such information. This is maybe because Bitcoin is such a well known and long-standing project, it has a huge market for information, and so there a lot of information about it. For Proof of Stake, there is much less information, and I cannot name a single coin in PoS side that would be such a “canonical” project. Ethereum is talking about making the switch from PoW to PoS, and some proposals for this in the form of “Casper” proposals are available. However, these are written in a very theoretical manner, and digress all over the place. Not so great for someone like me, who just wants a good understanding at a bit more detailed level.

So what do I think is Proof of Stake? At a high level it is always described as distributing new coins based on the amount of coins you already have. So each round (staking interval) gives a set of new coins to existing owners, based on their “stake” in the network. The stake would be the number of coins you already have, and are willing to “stake”. And quite often this is referred to as requiring the users wallet to be online to be “staking”, that is to be able to receive these rewards. So one might expect that the wallet is connected to other peers in a similar peer-to-peer network as for PoW, but the network somehow agrees to give new coins pediodically to someone “staking” instead of using a PoW hashing contest. But how does this staking actually happen? How does a distributed and assumably decentralized network ever agree on who gets the coins?

I found some Reddit post (which I lost the link for) and a couple of StackExchange posts (post1, post2). And the Blackcoin and Peercoin whitepapers. These all seem to describe the same thing, even if with a bit of a different twist. In any case, what they describe makes sense, so I will go with that. There is also Ouroboros whitepaper, Tendermint whitepaper, and the Ethereum PoS FAQ. But these drift off in way too academic way for me.

The algorithm I get from these is very similar to that of PoW hashing. The main difference is that, there is no field in the block data that is hashed that can be easily varied to such a large extent that is done in the PoW implementations. Instead, only a selected set of core values in the block is selected as the state given as input to the hash function. This set of core values is called the “kernel”. Since the data selected for this input has only a limited amount of possible values, and thus hash outputs, it makes no sense for the “miner” to spend a lot of effort and energy on trying different values. Mainly because there are only a few values to try at each point in time. This is how you try to escape the hashing contest of PoW coins, and save the planet by reducing energy consumption.

So in this scenario, I would assume each “staker” takes their set of kernel values, along with a timestamp, hashes these, and if they get a value that fits the difficulty target of the network, they can be selected to “mint” new coins. The timestamp is masked to remove the lowest bits, in order to make it unfeasible to try a large number of close-to-each-other timestamps. The number of coins owned is used as a way to weight the hash outcome higher, and to increase the likelihood to “mint” the new coins on each round based on the number of coins staked. A number of previous blocks hashes is also included in the equation to make it more difficult to try to come up with different hash values for future blocks.

From the posts I linked above, I get the more sensible explanation of how the node to create a block is selected. So the hash you end up with from your kernel data is compared against the network difficulty (similar to PoW), and the difficulty is also adjusted at similar intervals as with PoW. Some of the early coins (e.g., the Peercoin whitepaper) also discuss adding coin age into the equation to favor coins that have not been selected recently. Later on the Blackcoin whitepaper sees this as a potential problem (even possible attack vector) and removes age from the equation. And from there I guess it has diverged to all sorts of academic exercise, that are simple but made to look complex.

In any case, I guess there are other ways to do different parts of PoS. But the process in general as I described above seems like it would make sense. Please do correct and tell me how it really works.. ūüôā

Something I wonder after all this is, why is it always said you need to have your wallet online to “stake”? I don’t see a need, other than to have something around to calculate your kernel hash. You could just grab the kernel input data from the wallet, and keep a separate program online to calculate the PoS hashes. This way there would be much less risk of wallet being compromised, which is often cited as a risk of PoS staking. I must be missing something again?

 

Predicting issue categories on Github

Practical examples of applying machine learning seem to be a bit difficult to find. So I tried to create one for a presentation I was doing on testing and data analytics. I made a review of works in the area, and just chose one for illustrate. This one tries to predict a target category to assign for an issue report. I used ARM mBed OS as a test target since it has issues available on Github and there were some people who work with it attending the presentation.

This demo “service” I created works by first training a predictive model based on a set of previous issue reports. I downloaded the reports from the issue repository. The amount of data available there was so small, I just downloaded the issues manually using the Github API that let me download the data for 100 issues at once. Automating the download should be pretty easy if needed. The amount of data is small, and there are a large number of categories to predict, so not the best for results, but serves as an example to illustrate the concept.

And no, there is no deep learning involved here, so not quite that trendy. I don’t think it is all that necessary for this purpose or this size of data. But could work better of course, if you do try, post the code so we can play as well.

The Github issues API allows me to download the issues in batches. For example, to download page 12 of closed issues, with 100 issues per page, the URL to request is https://api.github.com/repos/ARMmbed/mbed-os/issues?state=closed&page=12&per_page=100. The API seems to cut it down to 100 even if using bigger values than 100. Or I just didn’t quite use it right, whichever. The API docs describe the parameters quite clearly, I downloaded open and closed issues separately, even if I did not use the separation in any meaningful way in the end.

The code here is all in Python. The final classifier/prediction services code is available on my Github repository.

First build a set of stopwords to do some cleaning on the issue descriptions:

	stop_words = set(stopwords.words('english'))
	stop_words = stop_words.union(set(punctuation))
	stop_words.update(["''", "--", "**"])

The above code uses the common NLTK stopwords, a set of punctuation symbols, and a few commonly occurring symbol combinations I found in the data. Since later on I clean it up with another regular expression, probably just the NLTK stopwords would suffice here as well..

To preprocess the issue descriptions before applying machine learning algorightms:

def preprocess_report(body_o):
	#clean issue body text. leave only alphabetical and numerical characters and some specials such as +.,:/\
	body = re.sub('[^A-Za-z0-9 /\\\_+.,:\n]+', '', body_o)
	# replace URL separators with space so the parts of the url become separate words
	body = re.sub('[/\\\]', ' ', body)
	# finally lemmatize all words for the analysis
	lemmatizer = WordNetLemmatizer()
	# text tokens are basis for the features
	text_tokens = [lemmatizer.lemmatize(word) for word in word_tokenize(body.lower()) if word not in stop_words]
	return text_tokens

Above code is intended to remove all but standard alphanumeric characters from the text, remove stop words, and tokenize the remaining text into separate words. It also splits URL’s into parts as separate words. The lemmatization changes known words into their baseforms (e.g., “car” and “cars” become “car”). This just makes it easier for the machine learning algorithm to match words together. Another option is stemming, but lemmatization produces more human-friendly words so I use that.

I stored the downloaded issues as JSON files (as Github API gives) in the data directory. To read all these filenames for processing:

#names of files containing closed and open issues (at time of download)
closed_files = glob.glob("data/*-closed-*")
open_files = glob.glob("data/*-closed-*")

To process those files, I need to pick only the ones with an assigned “component” value. This is what is the training target label. The algorithm is trained to predict this “component” value from the issue description, so without this label, the piece of data is not useful for training.

def process_files(files):
	'''
	process the given set of files by collecting issue body text and labels.
	also cleans and lemmatizes the body text

	:param files: names of files to process
	:return: nothing
	'''
	global total

	for filename in files:
		with open(filename, encoding="utf-8") as json_data:
			print(filename)
			all_issues = json.load(json_data)
			for issue in all_issues:
				labels = issue["labels"]
				for label in labels:
					if label["name"].startswith("component:"):
						name = label["name"]
						body_o = issue["body"]
						text_tokens = preprocess_report(body_o)
						all_text_tokens.append((text_tokens))
						#component_labels are prediction targets
						component_labels.append(name)
						total += 1

print("total: ", total)
						

There is a limited number of such labeled data items, as many of the downloaded issues do not have this label assigned. The print at the end of the above code shows the total number of items with the “component” label given, and the number in this dataset is 1078.

Besides removing stop-words and otherwise cleaning up the documents for NLP, combining words sometimes makes sense. Pairs, triplets, and so on are sometimes meaningful. Typical example is words “new” and “york” in a document, versus “new york”. This would be an example of a bi-gram, combining two words into “new_york”. To do this, I use the gensim package:

import gensim

#https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/
# Build the bigram and trigram models
bigram = gensim.models.Phrases(all_text_tokens, min_count=5, threshold=100) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[all_text_tokens], threshold=100)

# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

#just to see it works
print(trigram_mod[bigram_mod[all_text_tokens[4]]])

#transform identified word pairs and triplets into bigrams and trigrams
text_tokens = [trigram_mod[bigram_mod[text]] for text in all_text_tokens]

#build whole documents from text tokens. some algorithms work on documents not tokens.
texts = [" ".join(tokens) for tokens in text_tokens]

The above code uses thresholds and minimum co-occurrence counts to avoid combining every possible word with every other possible word. So only word-pairs and triplets that commonly are found to occur together are used (replaced) in the document.

Use the Python data processing library Pandas to turn it into suitable format for the machine learning algorithms:

from pandas import DataFrame

df = DataFrame()

df["texts"] = texts
df["text_tokens"] = text_tokens
df["component"] = component_labels

print(df.head(2))

First to have a look at the data:

#how many issues are there in our data for all the target labels, assigned component counts
value_counts = df["component"].value_counts()
#print how many times each component/target label exists in the training data
print(value_counts)
#remove all targets for which we have less than 10 training samples.
#K-fold validation with 5 splits requires min 5 to have 1 in each split. This makes it 2, which is still tiny but at least it sorta runs
indices = df["component"].isin(value_counts[value_counts > 9].index)
#this is the weird syntax i never remember, them python tricks. i think it slices the dataframe to remove the items not in "indices" list
df = df.loc[indices, :]

The above code actually already does a bit more. It also filters the dataset to remove the rows with component values that only have less than 10 items. So this is the unfiltered list:

component: tools              162
component: hal                128
component: export             124
component: networking         118
component: drivers            110
component: rtos                88
component: filesystem          80
component: tls                 78
component: docs                60
component: usb                 54
component: ble                 38
component: events              14
component: cmsis               10
component: stdlib               4
component: rpc                  4
component: uvisor               2
component: greentea-client      2
component: compiler             2

And after filtering, the last four rows will have been removed. So in the end, the dataset will not have any rows with labelsl “rpc”, “uvisor”, “greentea-client”, or “compiler”. This is because I will later use stratified 5-fold cross-validation and this requires a minimum of 5 items of each. Filtering with minimum of 10 instances for a label, it is at least possible to have 2 of the least common “component” in each fold.

In a more realistic case, much more data would be needed to cover all categories, and I would also look at possibly combining some of the different categories. And rebuilding the model every now and then, depending on how much effort it is, how much new data comes in, etc.

To use the “component” values as target labels for machine learning, they need to be numerical (integers). This does the transformation:

from sklearn.preprocessing import LabelEncoder

# encode class values as integers
encoder = LabelEncoder()
encoded_label = encoder.fit_transform(df.component)

Just to see how the mapping of integer id’s to labels after label encoding looks:

unique, counts = np.unique(encoded_label, return_counts=True)
print(unique) #the set of unique encoded labels
print(counts) #the number of instances for each label

The result (first line = id, second line = number of items):

[ 0  1  2  3  4  5  6  7  8  9 10 11 12]
[ 38  10  60 110  14 124  80 128 118  88  78 162  54]

Mapping the labels to integers:

#which textual label/component name matches which integer label
le_name_mapping = dict(zip(encoder.classes_, encoder.transform(encoder.classes_)))
print(le_name_mapping)

#which integer matches which textual label/component name
le_id_mapping = dict(zip(encoder.transform(encoder.classes_), encoder.classes_))
print(le_id_mapping)

So the first is to print “label: id” pairs, and the second to print “id: label” pairs. The first one looks like this:

'component: ble': 0, 
'component: cmsis': 1, 
'component: docs': 2, 
'component: drivers': 3, 
'component: events': 4, 
'component: export': 5, 
'component: filesystem': 6, 
'component: hal': 7, 
'component: networking': 8, 
'component: rtos': 9, 
'component: tls': 10, 
'component: tools': 11, 
'component: usb': 12

Now, to turn the text into suitable input for a machine learning algorithm, I transform the documents into their TF-IDF presentation. Well, if you go all deep learning with LSTM and the like, this may not be necessary. But don’t take my word for it, I am still trying to figure some of that out.

TF-IDF stands for term frequency (TF) – inverse document frequency (IDF). For example, if the word “bob” appears often in a document, it has a high term frequency for that document. Generally, one might consider such a word to describe that document well (or the concepts in the document). However, if the same word also appears commonly in all the documents (in the “corpus”), it is not really specific to that document, and not very representative of that document vs all other documents in the corpus. So IDF is used to modify the TF so that words that appear often in a document but less often in others in the corpus get a higher weight. And if the word appears often across many documents, it gets a lower weight. This is TF-IDF.

Traditional machine learning approaches also require a more fixed size set of input features. Since documents are of varying length, this can be a bit of an issue. Well, I believe some deep learning models also require this (e.g., CNN), while others less so (e.g., sequential models such as LSTM). Digressing. TF-IDF also (as far as I understand) results in a fixed length feature vector for all documents. Or read this on Stack Overflow and make your own mind up.

Anyway, to my understanding, the feature space (set of all features) after TF-IDF processing becomes the set of all unique words across all documents. Each of these is given a TF-IDF score for each document. For the words that do not exist in a document, the score is 0. And most documents don’t have all words in them, so this results in a very “sparse matrix”, where the zeroes are not really stored. That’s how you can actually process some reasonable sized set of documents in memory.

So, in any case, to convert the documents to TF-IDF presentation:

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5)

#transfor all documents into TFIDF vectors.
#TF-IDF vectors are all same length, same word at same index, value is its TFIDF for that word in that document
features_transformed = vectorizer.fit_transform(features)

Above code fits the vectorizer to the corpus and then transforms all the documents to their TF-IDF representations. To my understanding (from SO), the fit part counts the word occurrences in the corpus, and the transform part uses these overall counts to transform each document into TF-IDF.

It is possible also to print out all the words the TF-IDF found in the corpus:

#the TFIDF feature names is a long list of all unique words found
print(vectorizer.get_feature_names())
feature_names = np.array(vectorizer.get_feature_names())
print(len(feature_names))

Now to train a classifier to predict the component based on a given document:

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score

kfold = StratifiedKFold(n_splits=5) #5-fold cross validation

#the classifier to use, the parameters are selected based on a set i tried before
clf = RandomForestClassifier(n_estimators=50, min_samples_leaf=1, min_samples_split=5)

results = cross_val_score(clf, features_transformed, encoded_label, cv=kfold)

print("Baseline: %.2f%% (%.2f%%)" % (results.mean() * 100, results.std() * 100))

#fit the classifier on the TFIDF transformed word features, train it to predict the component
clf.fit(features_transformed, encoded_label)
probabilities = clf.predict_proba(features_transformed[0])
print(probabilities)

In the above I am using RandomForest classifier, with a set of parameters previously tuned. I am also using 5-fold cross validation, meaning the data is split into 5 different parts. The parts are “stratified”, meaning each fold has about the same percentage of each target label as the original set. This is why I removed the labels with less that 10 instances in the beginning, to have at least 2 for each class. Which is till super-tiny but thats what this example is about.

The last part of the code above also runs a prediction on one of the transformed documents just to try it out.

Now, to run predictions on previously unseen documents:

import requests

def predict_component(issue):
	'''
	use this to get a set of predictions for a given issue.

	:param issue: issue id from github.
	:return: list of tuples (component name, probability)
	'''
	#first load text for the given issue from github
	url = "https://api.github.com/repos/ARMmbed/mbed-os/issues/" + str(issue)
	r = requests.get(url)
	url_json = json.loads(r.content)
	print(url_json)
	#process the loaded issue data to format matching what the classifier is trained on
	issue_tokens = preprocess_report(url_json["body"])
	issue_tokens = trigram_mod[bigram_mod[issue_tokens]]
	issue_text = " ".join(issue_tokens)
	features_transformed = vectorizer.transform([issue_text])
	#and predict the probability of each component type
	probabilities = clf.predict_proba(features_transformed)
	result = []
	for idx in range(probabilities.shape[1]):
		name = le_id_mapping[idx]
		prob = (probabilities[0, idx]*100)
		prob_str = "%.2f%%" % prob
		print(name, ":", prob_str)
		result.append((name, prob_str))
	return result

This code takes as parameter an issue number for the ARM mBed Github repo. Downloads the issue data, preprocesses it similar to the training data (clean, tokenize, lemmatize, TF-IDF). This is then used as a set of features to predict the component, based on the model trained earlier.

The “predict_component” method/function can then be called from elsewhere. In my case, I made a simple web page to call it. As noted in the beginning of this post, you can find that webserver code, as well as all the code above on my Github repository.

That’s pretty much it. Not very complicated to put some Python lines one after another, but knowing which lines and in which order is perhaps what takes the time to learn. Having someone else around to do it for you if you are a domain expert (e.g., testing, software engineering or similar in this case) is handy, but it can also be useful to have some idea of what happens, or how the algorithms in general work.

Something I left out in all the above was the code to try out different classifiers and their parameters. So I will just put it below for reference.

First a few helper methods:

def top_tfidf_feats(row, features, top_n=25):
	''' Get top n tfidf values in row and return them with their corresponding feature names.'''
	topn_ids = np.argsort(row)[::-1][:top_n]
	top_feats = [(features[i], row[i]) for i in topn_ids]
	df = pd.DataFrame(top_feats)
	df.columns = ['feature', 'tfidf']
	return df

#this prints it for the first document in the set
arr = features_test_transformed[0].toarray()
top_tfidf_feats(arr[0], feature_names)

def show_most_informative_features(vectorizer, clf, n=20):
	feature_names = vectorizer.get_feature_names()
	coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
	top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])
	for (coef_1, fn_1), (coef_2, fn_2) in top:
		print ("\t%.4f\t%-15s\t\t%.4f\t%-15s" % (coef_1, fn_1, coef_2, fn_2))

In above code, “top_tfidf_feats” prints the top words with highest TF-IDF score for a document. So in a sense, it prints the words that TF-IDF has determined to be most uniquely representing that document.

The “show_most_informative_features” prints the top features that a given classifier has determined to be most descriptive/informative for distinguishing target labels. This only works for certain classifiers, which have such simple co-efficients (feature weights). Such as multinomial naive-bayes (MultinomialNB below).

Here is the code to actually try out the classifiers then:

from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB()
clf.fit(features_train_transformed, labels_train)

from sklearn.metrics import accuracy_score

y_pred = clf.predict(features_test_transformed)
y_true = labels_test
acc_score = accuracy_score(y_true, y_pred)
print("MNB accuracy:"+str(acc_score))

show_most_informative_features(vectorizer, clf)

#try it out on a single document
probabilities = clf.predict_proba(features_test_transformed[0])
print(probabilities)

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score

#set of parameters to try
estimators = [10, 20, 30, 40, 50]
min_splits = [5, 10, 20, 30, 40, 50]
min_leafs = [1, 2, 5, 10, 20, 30]

kfold = StratifiedKFold(n_splits=5) #5-fold cross validation

best_acc = 0.0
best_rf = None
for estimator in estimators:
	for min_split in min_splits:
		for min_leaf in min_leafs:
			print("estimators=", estimator, "min_split=", min_split, " min_leaf=", min_leaf)

			clf = RandomForestClassifier(n_estimators=estimator, min_samples_leaf=min_leaf, min_samples_split=min_split)

			results = cross_val_score(clf, features_transformed, encoded_label, cv=kfold)

			print("Baseline: %.2f%% (%.2f%%)" % (results.mean() * 100, results.std() * 100))

			if results.mean() > best_acc:
				best_acc = results.mean()
				best_rf = clf
				print("found better:", best_acc, ", ", best_rf)

print("best classifier:")
print(best_rf)

best_acc = 0
best_rf = None
for estimator in estimators:
	for min_split in min_splits:
		for min_leaf in min_leafs:
			print("estimators=", estimator, "min_split=", min_split, " min_leaf=", min_leaf)

			clf = RandomForestClassifier(n_estimators=estimator, min_samples_leaf=min_leaf, min_samples_split=min_split)
			clf.fit(features_train_transformed, labels_train)

			pred = clf.predict(features_test_transformed)

			accuracy = accuracy_score(labels_test, pred)

			print(accuracy)

			if accuracy > best_acc:
				best_acc = accuracy
				best_rf = clf
				print("found better:", best_acc, ", ", best_rf)
	

In the code above, I use loops to run through the parameters. There is also something called GridSearch in the Python libraries, as well as RandomSearch (for cases where trying all combos is expensive). But I prefer the ability to control the loops, print out whatever I like and all that.

The above code also shows two ways I tried to train/evaluate the RandomForest parameters. First is with k-fold, latter with single test-train split. I picked MultinomialNB and RandomForest because some internet searching gave me the impression they might work reasonably well for unbalanced class sets such as this one. Of course the final idea is always to try and see what works.. This worked quite fine for me. Or so it seems, machine learning seems to be always about iterating stuffs and learning and updating as you go. More data could change this all, or maybe finding some mistake, or having more domain or analytics knowledge, finding mismatching results, or anything really.

What the unbalanced refers to is the number of instances of different components in this dataset, some “components” have many bug repots, while others much less. For many learning algorithms this seems to be an issue. Some searches indicated RandomForest should be fairly robust for this type so this is also one reason I used it.

Running the above code to experiment with the parameters also produced some slightly concerning results. The accuracy for the classifier ranged from 30% to 95% with smallish parameters changes. I would guess that also speaks for the small dataset causing potential issues. Also re-running the same code would give different classifications for new (unseen) instances. Which is what you might expect when I am not setting the randomization seed. But then I would also expect the accuracy to vary somewhat, which it didn’t. So just don’t take this as more than an example of how you might apply ML for some SW testing related tasks. Take it to highlight the need to always learn more, try more things, and get a decent amount of data, evolve models constantly, etc. And post some comments on all the things you think are wrong in this post/code so we can verify the approach of learning and updating all the time :).

In any case, I hope the example is useful for giving an idea of one way how machine learning could be applied in software testing related aspects. Now write me some nice LSTM or whatever is the latest trend in deep learning models, figure out any issues in my code, or whatever, and post some comments. Cheers.

Porting Naivecoin tutorial to Golang (first parts)

Recently I got interested in blockchains, and how do they work in practice. Found a nice tutorial called Naivecoin: a tutorial for building a cryptocurrency. There is also a nice book on the topic called Mastering Bitcoin: Programming the Open Blockchain. Had to get that too, didn’t I.

So to get a better understanding of the topic, I tried implementing the tutorial in Go. Using a different programming environment requires me to look up the actual implementation and try to understand it, as opposed to just copy-pasting all the code. I tried the parts until part 3, and set my code up on Github, hopefully I will update it with the remaining parts of the Naivecoin tutorial, and other related topics, sometime later.

The Naivecoin first part focuses on building the basic blockchain. To start, a block structure is needed to describe the blockchain:

type Block struct {
	Index        int  //the block index in the chain
	Hash         string //hash for this block
	PreviousHash string //hash for previous block
	Timestamp    time.Time //time when this block was created
	Data         string //the data in this block. could be anything. not really needed since real data is transaction but for fun..
	Transactions []Transaction  //the transactions in this block
	Difficulty	 int //block difficulty when created
	Nonce		 int //nonce used to find the hash for this block
}

A blockchain is a long list (or a chain) of blocks. The Block structure above has various fields to describe its position in the chain, its contents, and metadata used to provide assurance over the chain validity.

  • Index: The number of the block on the blockchain. So 1st block is 1, 10th block is 10.
  • Hash: Cryptographic hash of all the fields in the block together. Verifies the block has not been modified since creation (e.g., change transaction address).
  • PreviousHash: Should match the Hash value of the previous block in the blockchain. Useful to verify chain integrity I am sure.
  • Timestamp: Time when this block was created. Used for hash variation, and to keep the creation of blocks within a defined time limit (to make some attacks harder).
  • Data: The blockchain is supposed to store data in an “immutable ledger”. I guess the typical blockchain application is still cryptocurrencies, where the data is the Transactions part. I put this extra data field in just to play with other data if I feel like it.
  • <Transactions: List of transactions included in this block. Moving coins from one person to the other.
  • Difficulty: The mining difficulty when this block was created. Useful to check the hash validity etc.
  • Nonce: A value to vary in trying to get a hash that matches the set difficulty. In case of bigger difficulty, may be required. But for this tutorial implementation I guess this is fine.

The blockchain is composed of these blocks, each referencing the previous one, in a chain. Thus the blockchain, woohoo. So, first how to calculate the hash for a block:

//calculate hash string for the given block
func hash(block *Block) string {
	indexStr := strconv.Itoa(block.Index)
	timeStr := strconv.FormatUint(uint64(block.Timestamp.Unix()), 16) //base 16 output
	nonceStr := strconv.Itoa(block.Nonce)
	diffStr := strconv.Itoa(block.Difficulty)
	txBytes, _ := json.Marshal(block.Transactions)
	txStr := string(txBytes)
	//this joins all the block elements to one long string with all elements appended after another, to produce the hash
	blockStr := strings.Join([]string{indexStr, block.PreviousHash, timeStr, diffStr, block.Data, txStr, nonceStr}, " ")
	bytes := []byte(blockStr)
	hash := sha256.Sum256(bytes)
	s := hex.EncodeToString(hash[:]) //encode the Hash as a hex-string. the [:] is slicing to match datatypes in args
	return s
}

I think Bitcoin uses something like a Merkle tree to combine elements in a block. But in the above, all the elements in a block are simply turned into strings, concatenated into a single long string, and this string is hashed. So again, we could maybe do better but it works well for a tutorial such as Naivecoin.

Now, to create the blocks. As visible in the structure above, each block has to reference the previous ones to create the chain. And to provide assurance over the overall chain, hashes of the block and of the previous block are provided. But the chain has to start somewhere. The first block in the chain is called the genesis block. It has no previous hash, as it has no previous block. So we create it specifically first:

//create genesis block, the first one on the chain to bootstrap the chain
func createGenesisBlock(addToChain bool) Block {
	genesisTime, _ := time.Parse("Jan 2 15:04 2006", "Mar 15 19:00 2018")
	block := Block{1, "", "0", genesisTime, "Teemu oli täällä", nil,1, 1}
	hash := hash(&block)
	block.Hash = hash
	if addToChain {
		globalChain = append(globalChain, block)
	}
	return block
}

More generally, to create non-genesis blocks:

//create a block from the given parameters, and find a nonce to produce a hash matching the difficulty
//finally, append new block to current chain
func createBlock(txs []Transaction, blockData string, difficulty int) Block {
	chainLength := len(globalChain)
	previous := globalChain[chainLength-1]
	index := previous.Index + 1
	timestamp := time.Now().UTC()
	nonce := 0
	newBlock := Block{index, "", previous.Hash, timestamp, blockData, txs, difficulty, nonce}
	for {
		hash := hash(&newBlock)
		newBlock.Hash = hash
		if verifyHashVsDifficulty(hash, difficulty) {
			addBlock(newBlock)
			return newBlock
		}
		nonce++
		newBlock.Nonce = nonce
	}
}

The above code takes the block transactions, data and the current difficulty of the chain as parameters. The rest of the information required to create a block is taken from the previous block (block index, previous block hash) or system clock (current timestamp). Finally, it loops the different nonce values until it finds a hash matching the current difficulty level. Like I mentioned before, it might be a bit simplified, but this is certainly good enough for a tutorial such as the Naivecoin.

So, in this case, to verify the block difficulty:

func verifyHashVsDifficulty(hash string, difficulty int) bool {
	prefix := strings.Repeat("0", difficulty)
	return strings.HasPrefix(hash, prefix)
}

This is quite simple, it just measures that the given hash-string starts with a given number of zeroes. In this case the measure of difficulty is the number of zeroes the hash should start with. In a more general case, I believe the difficulty can be a number under which the hash should fall (2nd top answer at the time of writing this). That gives you much more granularity to define the difficulty.. But, again, no need to go overly complex on things in a Naivecoin tutorial.

Similarly, adding a block to the blockchain is quite simple:

//add a new block to the existing chain
func addBlock(block Block) {
	chainLength := len(globalChain)
	previousBlock := globalChain[chainLength-1]
	block.PreviousHash = previousBlock.Hash
	globalChain = append(globalChain, block)
	for _, tx := range block.Transactions {
		addTransaction(tx)
	}
	//todo: check block hash matches difficulty
}

So as I commented above, I should also check that the new block matches the required difficulty. Would not be too hard but I will leave that update for the next version.

Adding the block also requires adding all the transactions within that block to the list of active transactions:

func addTransaction(tx Transaction) {
	oldTx := findUnspentTransaction(tx.Sender, tx.Id)
	if oldTx >= 0 {
		print("transaction already exists, not adding: ", tx.Id)
		return
	}
	allTransactions = append(allTransactions, tx)
	for _, txIn := range tx.TxIns {
		deleteUnspentTransaction(tx.Sender, txIn.TxId)
	}
	for idx, txOut := range tx.TxOuts {
		utx := UnspentTxOut{tx.Id, idx, txOut.Address, txOut.Amount}
		unspentTxOuts = append(unspentTxOuts, utx)
	}
}

So, as the Naivecoin tutorial so nicely explains, each transaction has two types of “sub-transactions” as I call them here. Not an official term or anything. But these are:

  • TxIn: Transaction inputs.
  • TxIn: Transaction outputs.

I found the explanations related to how these work very confusing to start with. Eventually I got it, but it is not the most intuitive thing. Especially with the terminology of Transaction (Tx), TxIn, TxOut, … Especially since the TxIn and TxOut are mostly the same thing, but expressed in a different way. Well, that is my undestanding anyway. Please do correct me.

A TxOut (would it be correct to call it a transaction output?) is what creates “coins” to spend. Sending coins to someone creates a TxOut. Since a single transaction can contain multiple TxOut instances, it just means you can send coins to multiple people in a single transaction. Or multiple wallets that is. The most common seems to be to send coins to someone else and back to yourself. Why is this?

To create TxOut’s you need a matching amount of TxIn’s. Or the amount of coins referenced in each should match. This check should actually be a part of the addTransaction function above. To check that the transaction is valid by checking that the input and output amounts add up to the same sums. But you can only add existing TxOut instances to a transaction as TxIn’s. So if you got a TxOut giving you 100 coins, you can only add that TxOut to your next transaction as a TxIn if you want to send someone coins. So the TxIn has to be 100 coins in this case. What if you want to send only 50 coins? Then you put the single TxIn with 100 coins, but create 2 TxOut’s for that transaction. One to send 50 coins to he received, and another to send the remaining 50 coins to yourself. This way the balances match again. Confusing? Yes. Check pretty pictures some way down this post.

Of course, if you send coins to an address that is not used, you will “burn” those coins. No-one can ever access them, because no-one knows the private key to match the public key used to sign that TxOut. You could maybe make a lot of money if you could find a key to some of the bigger coin-burn addresses. But I digress.

How do the coins initially come into existence if you can only send coins by referencing an existing TxOut? Where do the first TxOut come from? It has to all be bootstrapped somehow, right? This is what is called a Coinbase transaction. No, not according to the cryptocurrency selling company. They took the name from the term, not the other way around. A coinbase TxIn might look something like this:

var COINBASE_AMOUNT = 1000

func createCoinbaseTx(address string) Transaction {
	var cbTx Transaction

	var txIn TxIn
	txIn.TxIdx = len(globalChain)
	cbTx.TxIns = append(cbTx.TxIns, txIn)

	var txOut TxOut
	txOut.Amount = COINBASE_AMOUNT
	txOut.Address = address
	cbTx.TxOuts = append(cbTx.TxOuts, txOut)

	cbTx.Id = calculateTxId(cbTx)
	cbTx.Signature = "coinbase"

	return cbTx
}

The above code creates the special transaction called the coinbase transaction. In proof-of-work type solutions, such as in the Naivecoin tutorial, this is added to each mined block. So when a miner finds a hash for a new block (thus “creating” it), they get the reward for the block. This reward is paid as the coinbase transaction. Each block can have one, and the miner can create it in the block. I would expect the miner then puts their own address in the transaction as the receiver. For the TxOut address. The TxIn in this case comes from nowhere, from thin air, or from a made up input. This is how the money is made here..

To my understanding, the way coinbase transactions are verified is simply by each node in the network accepting only a single coinbase transaction in a block, with specific parameters. These should match the number of coins to be rewarded according to the coin specification, and the defined coinbase signature. I believe in Bitcoin you can set many of the other fields as you like. For example, people can hide messages in the TxIn address or some other fields of the coinbase transaction. Because they do not need to match anything real. Once the coinbase amount is issued to a miner, they can then send coins using the coinbase TxOut as part of their new transaction.

The related data structures:

type TxOut struct {
	Address string	//receiving public key
	Amount int		//amount of coin units to send/receive
}

type TxIn struct {
	TxId      string	//id of the transaction inside which this TxIn should be found (in the list of TxOut)
	TxIdx     int		//index of TxOut this refers to inside the transaction
}

type UnspentTxOut struct {
	TxId    string	//transaction id
	TxIdx   int		//index of txout in transaction
	Address string  //public key of owner
	Amount  int		//amount coin units that was sent/received
}

//transaction is from a single person/entity
//inputs from that entity use funds, outputs show how much and where
//all the time there should be some list kept and updated based on this
type Transaction struct {
	Id string
	Sender string //the address/public key of the sender
	Signature string	//signature including all txin and txout. in this case we sign Transaction.Id since that already has all the TxIn and TxOut
	TxIns []TxIn
	TxOuts []TxOut
}

Each block contains a set of transactions, which contain a set of TxIn and TxOut.

  • The TxOut defines the address (encoding of recipients public key) of the recipient, and the amount of coins to send.
  • The TxIn defines a reference to a previous TxOut. So an identifier to find the transaction which contains it, and the index of the TxOut within that transaction.
  • The transaction itself contains the address (public key) of the sender. Since each TxIn and TxOut is contained in the transaction, and references that transaction, the owner can be checked against the address of the sender in the transaction. So the recipient in the “spent” TxOut should match the sender of this new transaction.
  • A prespecified way of calculating the transaction id is defined in the coin specification. This is then used to create and encode a hash value for the transaction id. I guess most real coins would use some form of Merkle-trees as linked high above. In this case the naivecoin simply adds the information on the sender and contained TxIn and TxOut instances into a long string and hashes that.
  • The signature is used to sign the transaction id (already containing all the transaction data in the hash) with the sender private key, using ECDSA signatures, as described in my previous post. The authenticity of the transaction and all its contents can then be verified by using the public key of the sender as embedded in the sender address to verify the signature. Brilliant stuff.

In pictures this looks something like this:

tx2

In the above example (in the figure above), Bob is sending 550 coins to Alice. To do this, Bob needs TxIn’s that sum up to minimum of 550. He has a set that sums up to 600. So these are used as the TxIn’s in this transaction. As 600 is more than 550, there are two TxOut’s created. One assigns the 550 coins to Alice, and the other 50 coins back to Bob. The blockchain tracks coin amounts using the TxOut instances stored within transactions, so it can only “spend” a full TxOut. This is why the “change” (50 coins in this example) generates a completely new TxOut.

The wording in the figure above for “deleteing” inputs simply refers to the TxIn’s being marked as used, and not being able to use them again. So the previous TxOut they refer to (from previous transactions that gave coins to Bob) are marked as used. As the wallet application and blockchain nodes track which TxOut are unused (not referenced by any accepted TxIn), it can count the balance of a user as a sum of their unspent TxOut.

The wording for “creating” new TxOut in the figure above refers to adding new TxOut to the list of unspent TxOut. In this case it adds one for Bob (50 coins) and one for Alice (550 coins).

In the code showing the TxOut, TxIn, and UnspentTxOut higher above the structure of each of these is shown. TxId refers each TxOut and TxIn to a specific transaction where the associated TxOut was created. So also TxIn refers to TxOut but serves as a way to mark that TxOut as “spent”. The reference consists of the transaction id (each transaction in the blockchain having a unique id), and the TxIdx refers to the index of the TxOut within a transaction. Since a transaction can have multiple TxOut, this is just the index in the list of TxOut within the transaction to identify which one the TxOut exactly is.

Each TxOut is assigned to a specific address, and the address has the recipient public key. Each transaction is signed using the senders private key, and this signature can be verified using the signers public key. Thus each TxIn can be verified to be spendable by the person who signed the transaction containing the TxIn. Following the TxIn reference to the TxOut it is referring to, and looking up the address it was assigned to gives the public key. By checking that this public key matches the private key used to sign the new transaction wanting the spend that TxIn, it can be verified that the person creating the transaction is in possession of the private key associated to that public key.

Proper transaction verification should check that all TxIn match the transaction signature, there are sufficient number of the TxIn, and that the TxIn total sum matches the total sum of TxOut. Probably much more too, but that is a good start.

Much confusing all this is. Hope you, dear reader, if you exist, find more information to understand my ramblings and to correct me.

After the transaction in the figure above, the end result looks something like this:

tx3

With all this in place, we can write the code to send coins:

func sendCoins(privKey *ecdsa.PrivateKey, to string, count int) Transaction {
	from := encodePublicKey(privKey.PublicKey)
	//TODO: error handling
	txIns, total := findTxInsFor(from, count)
	txOuts := splitTxIns(from, to, count, total)
	tx := createTx(privKey, txIns, txOuts)
	return tx
}

func findTxInsFor(address string, amount int) ([]TxIn, int) {
	balance := 0
	var unspents []TxIn
	for _, val := range unspentTxOuts {
		if val.Address == address {
			balance += val.Amount
			txIn := TxIn{val.TxId, val.TxIdx}
			unspents = append(unspents, txIn)
		}
		if balance >= amount {
			return unspents, balance
		}
	}
	return nil, -1
}

func splitTxIns(from string, to string, toSend int, total int) []TxOut {
	diff := total - toSend
	txOut := TxOut{to, toSend}
	var txOuts []TxOut
	txOuts = append(txOuts, txOut)
	if diff == 0 {
		return txOuts
	}
	txOut2 := TxOut{from, diff}
	txOuts = append(txOuts, txOut2)
	return txOuts
}

func createTx(privKey *ecdsa.PrivateKey, txIns []TxIn, txOuts []TxOut) Transaction {
	pubKey := encodePublicKey(privKey.PublicKey)
	tx := Transaction{"",  pubKey,"", txIns, txOuts}

	signTxIns(tx, privKey)

	tx.Id = calculateTxId(tx)
	tx.Signature = signData(privKey, []byte(tx.Id))
	return tx
}

The full code for my Go port of the naivecoin tutorial (first parts) is on my Github.

There are various other concepts I would still like to look into. Ways to be able to effectively search the whole blockchain and all transactions in it effectively (to validate transactions, build block explorers, etc.) would be interesting. I understand some type of embedded databases may be used. And endlessly argued about. Ah, just when you though the programming language flame-wars of past are no longer, you find it all again. Much stability in argument, such peace it brings. In my Rock’n chair I kek at all things past and present. Sorry, just trying to learn to talk like the internet and crypto people do. On a lighter note, other topics of interest to look into include robust peer-to-peer connections, ring signatures, proof of stake, etc.

That about sums it up for my experiment for now. Do let me know what are all the things I got wrong.

Trying to learn ECDSA and GoLang

Recently I have been looking at the Naivecoin tutorial, and trying to implement it in Go to get an idea of how blockchains really work, and to learn some more Go as well. The tutorial code is in Javascript, and translating it to Go has been mostly straightforward. However, porting the third part with transactions was giving me some issues. I had trouble figuring how to port the signature part. This is tutorial the code in Javascript:

    const key = ec.keyFromPrivate(privateKey, 'hex');
    const signature: string = toHexString(key.sign(dataToSign).toDER());

This is nice and simple, just suitable for a high-level framework and tutorial. However, to implement it myself in Go, …

The JS code above takes the private key as a hex formatted string, and parses that into a Javascript PrivateKey object. This key object is then used to sign the “dataToSign”, the signature is formatted as something called “DER”, and the result is formatted as a hex string. What does all that mean?

The tutorial refers to Elliptic Curve Digital Signature Algorithm (ECDSA). The data to sign in this case is the SHA256 hash of the transaction ID. So how to do this in Go? Go has an ecdsa package with private keys, public keys, and functions to sign and verify data. Sounds good. But the documentation is quite sparse, so how do I know how to properly use it?

To try it, I decided to first write a program in Java using ECDSA signatures, and use it to compare to the results of the Go version I would write. This way I would have another point of reference to compare my results to, and to understand if I did something wrong. I seemed to find more information about the Java implementation, and since I am more familiar with Java in general..

So first to generate the keys to use for signatures in Java:

	public static String generateKey() throws Exception {
		KeyPairGenerator keyGen = KeyPairGenerator.getInstance("EC");
		SecureRandom random = SecureRandom.getInstance("SHA1PRNG");

		keyGen.initialize(256, random); //256 bit key size

		KeyPair pair = keyGen.generateKeyPair();
		ECPrivateKey priv = (ECPrivateKey) pair.getPrivate();
		PublicKey pub = pair.getPublic();

		//actually also need public key, but lets get to that later...
		return priv;
	}

Above code starts with getting an “EC” key-pair generator, EC referring to Elliptic Curve. Then get a secure random number generator instance, in this case one based on SHA1 hash algorithm. Apparently this is fine, even if SHA1 is not recommended for everything these days. Not quite sure about the key size of 256 given, but maybe have to look at that later.. First to get this working.

The “priv.Encoded()” part turns the private key into a standard encoding format as a byte array. Base64 encode it for character representation, to copy to the Go version..

Next, to sign the data (or message, or whatever we want to sign..):

	public static byte[] signMsg(String msg, PrivateKey priv) throws Exception {
		Signature ecdsa = Signature.getInstance("SHA1withECDSA");

		ecdsa.initSign(priv);

		byte[] strByte = msg.getBytes("UTF-8");
		ecdsa.update(strByte);

		byte[] realSig = ecdsa.sign();

		System.out.println("Signature: " + new BigInteger(1, realSig).toString(16));

		return realSig;
	}

Above starts with gettings a Java instance of the ECDSA signature algorithm, with type “SHA1withECDSA”. I spent a good moment wondering what all this means, to be able to copy the functionality into the Go version. So long story short, first the data is hashed with SHA1 and then this hash is signed with ECDSA. Finally, the code above prints the signature bytes as a hexadecimal string (byte array->BigInteger->base 16 string). I can then simply copy-paste this hex-string to Go to see if I can get signature verification to work in Go vs Java. Brilliant.

First I tried to see that I can get the signature verification to work in Java:

	private static boolean verifySignature(PublicKey pubKey,String msg, byte[] signature) throws Exception {
		byte[] message = msg.getBytes("UTF-8");
		Signature ecdsa = Signature.getInstance("SHA1withECDSA");
		ecdsa.initVerify(pubKey);
		ecdsa.update(message);
		return ecdsa.verify(signature);
	}

The code above takes the public key associated with the private key that was used to sign the data (called “msg” here). It creates the same type of ECDSA signature instance as the signature creation previously. This is used to verify the signature is valid for the given message (data). So signed with the private key, verified with the public key. And yes, it returns true for the signed message string, and false otherwise, so it works. So now knowing I got this to work, I can try the same in Go, using the signature, public key, and private key that was used in Java. But again, the question. How do I move these over?

Java seems to provide functions such as key.getEncoded(). This gives a byte array. We can then Base64 encode it to get a string (I believe Bitcoin etc. use Base56 but the same idea). So something like this:

		//https://stackoverflow.com/questions/5355466/converting-secret-key-into-a-string-and-vice-versa
		byte[] pubEncoded = pub.getEncoded();
		String encodedPublicKey = Base64.getEncoder().encodeToString(pubEncoded);
		String encodedPrivateKey = Base64.getEncoder().encodeToString(priv.getEncoded());
		System.out.println(encodedPrivateKey);
		System.out.println(encodedPublicKey);

Maybe I could then take the output I just printed, and decode that into the key in Go? But what is the encoding? Well, the JDK docs say getEncoded() “Returns the key in its primary encoding format”. And what might that be? Well some internet searching and debugger runs later I come up with this (which works to re-create the keys in Java):

	public static PrivateKey base64ToPrivateKey(String encodedKey) throws Exception {
		byte[] decodedKey = Base64.getDecoder().decode(encodedKey);
		PKCS8EncodedKeySpec spec = new PKCS8EncodedKeySpec(decodedKey);
		KeyFactory factory = KeyFactory.getInstance("EC");
		PrivateKey privateKey = factory.generatePrivate(spec);
		return privateKey;
	}

	public static PublicKey base64ToPublicKey(String encodedKey) throws Exception {
		byte[] decodedKey = Base64.getDecoder().decode(encodedKey);
		X509EncodedKeySpec spec = new X509EncodedKeySpec(decodedKey);
		KeyFactory factory = KeyFactory.getInstance("EC");
		return publicKey;
	}

So the JDK encodes the private key in PKCS8 format, and the public key in some kind of X509 format. X509 seems to be related to certificates, and PKCS refers to “Public Key Cryptography Standards”, of which there are several. Both of these seem a bit complicated, as I was just looking to transfer the keys over. Since people can post those online for various crypto tools as short strings, it cannot be that difficult, can it?

I tried to look for ways to take PKCS8 and X509 data into Go and transform those into private and public keys. Did not get me too far with that. Instead, I figured there must be only a small part of the keys that is needed to reproduce them.

So I found that the private key has a single large number that is the important bit, and the public key can be calculated from the private key. And the public key in itself consists of two parameters, the x and y coordinates of a point (I assume on the elliptic curve). I browsed all over the internet trying to figure this all out, but did not keep records of all the sites I visited, so my references are kind of lost. However, here is one description that just so states the integer and point part. Anyway, please let me know of any good references for a non-mathematician like me to understand it if you have any.

To get the private key value into suitable format to pass around in Java:

	//https://stackoverflow.com/questions/40552688/generating-a-ecdsa-private-key-in-bouncy-castle-returns-a-public-key
	private static String getPrivateKeyAsHex(PrivateKey privateKey) {
		ECPrivateKey ecPrivateKey = (ECPrivateKey) privateKey;
		byte[] privateKeyBytes = ecPrivateKey.getS().toByteArray();
		String hex = bytesToHex(privateKeyBytes);
		return hex;
	}

The “hex” string in the above code is the big integer value that forms the basis of the private key. This can now be passed, backed up, or whatever we desire. Of course, it should be kept private so no posting it on the internet.

For the public key:

	private static String getPublicKeyAsHex(PublicKey publicKey) {
		ECPublicKey ecPublicKey = (ECPublicKey) publicKey;
		ECPoint ecPoint = ecPublicKey.getW();

		byte[] affineXBytes = ecPoint.getAffineX().toByteArray();
		byte[] affineYBytes = ecPoint.getAffineY().toByteArray();

		String hexX = bytesToHex(affineXBytes);
		String hexY = bytesToHex(affineYBytes);

		return hexX+":"+hexY;
	}

The above code takes the X and Y coordinates that make up the public key, combines them, and thus forms a single string that can be passed to get the X and Y for public key. A more sensible option would likely just create a single byte array with the length of the first part as first byte or two. Something like [byte count for X][bytes of X][bytes of Y]. But the string concatenation works for my simple example to try to understand it.

And then there is one more thing that needs to be encoded and passed between the implementations, which is the signature. Far above, I wrote the “signMsg()” method to build the signature. I also printed the signature bytes out as a hex-string. But what format is the signature in, and how do you translate it to another platform and verify it? It turns out Java gives the signatures in ASN.1 format. There is a good description of the format here. It’s not too complicated but how would I import that into Go again? I did not find any mention of this in the ECDSA package for Go. By searching with ASN.1 I did finally find an ASN.1 package for Go. But is there a way to do that without these (poorly documented) encodings?

Well, it turns out that ECDSA signatures can also be described by using just two large integers, which I refer to here as R and S. To get these in Java:

	public static byte[] signMsg(String msg, PrivateKey priv) throws Exception {
		Signature ecdsa = Signature.getInstance("SHA1withECDSA");

		ecdsa.initSign(priv);

		byte[] strByte = msg.getBytes("UTF-8");
		ecdsa.update(strByte);

		byte[] realSig = ecdsa.sign();

		System.out.println("R: "+extractR(realSig));
		System.out.println("S: "+extractS(realSig));

		return realSig;
	}

	//https://stackoverflow.com/questions/48783809/ecdsa-sign-with-bouncycastle-and-verify-with-crypto
	public static BigInteger extractR(byte[] signature) throws Exception {
		int startR = (signature[1] & 0x80) != 0 ? 3 : 2;
		int lengthR = signature[startR + 1];
		return new BigInteger(Arrays.copyOfRange(signature, startR + 2, startR + 2 + lengthR));
	}

	public static BigInteger extractS(byte[] signature) throws Exception {
		int startR = (signature[1] & 0x80) != 0 ? 3 : 2;
		int lengthR = signature[startR + 1];
		int startS = startR + 2 + lengthR;
		int lengthS = signature[startS + 1];
		return new BigInteger(Arrays.copyOfRange(signature, startS + 2, startS + 2 + lengthS));
	}

Above code takes the byte array of the signature, and parses the R and S from it as matching the ASN.1 specification I linked above. So with that, another alternative is again to just turn the R and S into hex-strings or Base56 encoded strings, combine them as a single byte-array and hex-string or base56 that, or whatever. But just those two values need to be passed to capture the signature.

Now, finally to parse all this data in Go and to verify the signature. First to get the private key from the hex-string:

	func hexToPrivateKey(hexStr string)  *ecdsa.PrivateKey {
		bytes, err := hex.DecodeString(hexStr)
		print(err)

		k := new(big.Int)
		k.SetBytes(bytes)

		priv := new(ecdsa.PrivateKey)
		curve := elliptic.P256()
		priv.PublicKey.Curve = curve
		priv.D = k
		priv.PublicKey.X, priv.PublicKey.Y = curve.ScalarBaseMult(k.Bytes())
		//this print can be used to verify if we got the same parameters as in Java version
		fmt.Printf("X: %d, Y: %d", priv.PublicKey.X, priv.PublicKey.Y)
		println()

		return priv
	}

The above code takes the hex-string, parses it into a byte array, creates a Go big integer from that, and sets the result as the value into the private key. The other part that is needed is the elliptic curve definition. In practice, one of a predefined set of curves is usually used, and the same curve is used for a specific purpose. So it can be defined as a constant, whichever is selected for the blockchain. In this case it is always defined as the P256 curve, both in the Java and Go versions. For example, Bitcoin uses the Secp256k1 curve. So I just set the curve and the big integer to create the private key. The public key (X and Y parameters) is calculated here from the private key, by using a multiplier function on the private key’s big integer.

To build the public key straight from the X and Y values passed in as hex-strings:

	func hexToPublicKey(xHex string, yHex string) *ecdsa.PublicKey {
		xBytes, _ := hex.DecodeString(xHex)
		x := new(big.Int)
		x.SetBytes(xBytes)

		yBytes, _ := hex.DecodeString(yHex)
		y := new(big.Int)
		y.SetBytes(yBytes)

		pub := new(ecdsa.PublicKey)
		pub.X = x
		pub.Y = y

		pub.Curve = elliptic.P256()

		return pub
	}

Again, base56 or similar would likely be more efficient representation. So the above code allows just to pass around the public key and not the private key, which is how it should be done. With the parameters X and Y passed, and the curve defined as a constant choice.

To create and verify the signature from the passed values:

	type ecdsaSignature struct {
		R, S *big.Int
	}

	func verifyMySig(pub *ecdsa.PublicKey, msg string, sig []byte) bool {
		//https://github.com/gtank/cryptopasta/blob/master/sign.go
		digest := sha1.Sum([]byte(msg))

		var esig ecdsaSignature
		asn1.Unmarshal(sig, &esig)
		//we can use these prints to compare to what we had in Java...
		fmt.Printf("R: %d , S: %d", esig.R, esig.S)
		println()
		return ecdsa.Verify(pub, digest[:], esig.R, esig.S)
	}

The above version reads the actual ASN.1 encoded signature that is produced by the Java default signature encoding. To get the functionality matching the Java “SHA1withECDSA” algorithm, I first have to hash the input data with SHA1 as done here. Since the Java version is a bit of a black box with just that string definition, I spent a good moment wondering about that. I would guess the same approach would apply for other choices such as “SHA256withECDSA” by just replacing the hash function with another. Alternatively, I can also just pass in directly the R and S values of the signature:

	func verifyMySig(pub *ecdsa.PublicKey, msg string, sig []byte) bool {
		//https://github.com/gtank/cryptopasta/blob/master/sign.go
		digest := sha1.Sum([]byte(msg))

		var esig ecdsaSignature
		esig.R.SetString("89498588918986623250776516710529930937349633484023489594523498325650057801271", 0)
		esig.S.SetString("67852785826834317523806560409094108489491289922250506276160316152060290646810", 0)
		fmt.Printf("R: %d , S: %d", esig.R, esig.S)
		println()
		return ecdsa.Verify(pub, digest[:], esig.R, esig.S)
	}

So in the above, the R and S are actually set from numbers passed in. Which normally would be encoded more efficiently, and given as parameters. However, this works to demonstrate. The two long strings are the integers for the R and S I printed out in the Java version.

Strangely, printing the R and S using the ASN.1 and the direct passing of the numbers gives a different value for R and S. Which is a bit odd. But they both verify the signature fine. I read somewhere that some transformations can be done on the signature numbers while keeping it valid. Maybe this is done as part of the encoding or something? I have no idea. But it works. Much trust such crypto skills I have.

func TestSigning(t *testing.T) {
	xHexStr := "4bc55d002653ffdbb53666a2424d0a223117c626b19acef89eefe9b3a6cfd0eb"
	yHexStr := "d8308953748596536b37e4b10ab0d247f6ee50336a1c5f9dc13e3c1bb0435727"
	ePubKey = hexToPublicKey(xHexStr, yHexStr)

	sig := "3045022071f06054f450f808aa53294d34f76afd288a23749628cc58add828e8b8f2b742022100f82dcb51cc63b29f4f8b0b838c6546be228ba11a7c23dc102c6d9dcba11a8ff2"
	sigHex, _ := hex.DecodeString(sig)
	ok := verifyMySig(ePubKey, "This is string to sign", sigHex)
	println(ok)
}

And finally, it works! Great ūüôā

Playing with Pairwise Testing and PICT

A while back, I was doing some lectures on advanced software testing technologies. One topic was combinatorial testing. Looking at the materials, there are good and free tools out there to generate tests to cover various combinations. Still, I don’t see many people use them, and the materials out there don’t seem too great.

Combinatorial testing here refers to having 2-way, 3-way, up to N-way (sometimes they seem to call it t-way…) combinations of data values in different test cases. 2-way is also called pairwise testing. This simply refers to all pairs of data values appearing in different test cases. For example, if one test uses values “A” and “B”, and another uses a combination of “A” and “C”, you would have covered the pairs A+B and A+C but not B+C. With large numbers of potential values, the set of potential combinations can grow pretty huge, so finding a minimal set to cover all combinations can be very useful.

The benefits

There is a nice graph over at NIST, including a PDF with a broader description. Basically these show that 2-way and 3-way combinations already show very high gains in finding defects over considering coverage of single variables alone. Of course, things get a bit more complicated when you need to find all relevant variables in the program control flow, how to define what you can combine, all the constraints, etc. Maybe later. Now I just wanted to try the combinatorial test generation.

Do Not. Try. Bad Yoda Joke. Do Try.

So I gave combinatorial test generation a go. Using a nice and freely available PICT tool from Microsoft Research. It even compiles on different platforms, not just Windows. Or so they say on their Github.

Unexpectedly, compiling and getting PICT to run on my OSX was quite simple. Just “make” and “make test” as suggested on the main Github page. Probably I had most dependencies already from before, but anyway, it was surprisingly easy.

I made “mymodels” and “myoutputs” directories under the directory I cloned the git and compile the code to. Just so I could keep some order to my stuffs. So this is why the following example commands work..

I started with the first example on PICT documentation page. The model looks like this:

Type:          Primary, Logical, Single, Span, Stripe, Mirror, RAID-5
Size:          10, 100, 500, 1000, 5000, 10000, 40000
Format method: quick, slow
File system:   FAT, FAT32, NTFS
Cluster size:  512, 1024, 2048, 4096, 8192, 16384, 32768, 65536
Compression:   on, off

Running the tool and getting some output is actually simpler than I expected:

./pict mymodels/example1.pict >myoutputs/example1.txt

PICT prints the list of generated test value combinations to the standard output. Which generally just translates to printing a bunch of lines on the console/screen. To save the generated values, I just pipe the output to myoutputs/example1.txt, as shown above. In this case, the output looks like this:

Type	Size	Format method	File system	Cluster size	Compression
Stripe	100	quick	FAT32	1024	on
Logical	10000	slow	NTFS	512	off
Primary	500	quick	FAT	65536	off
Span	10000	slow	FAT	16384	on
Logical	40000	quick	FAT32	16384	off
Span	1000	quick	NTFS	512	on
Span	10	slow	FAT32	32768	off
Stripe	5000	slow	NTFS	32768	on
RAID-5	500	slow	FAT	32768	on
Mirror	1000	quick	FAT	32768	off
Single	10	quick	NTFS	4096	on
RAID-5	100	slow	FAT32	4096	off
Mirror	100	slow	NTFS	65536	on
RAID-5	40000	quick	NTFS	2048	on
Stripe	5000	quick	FAT	4096	off
Primary	40000	slow	FAT	8192	on
Mirror	10	quick	FAT32	8192	off
Span	500	slow	FAT	1024	off
Single	1000	slow	FAT32	2048	off
Stripe	500	quick	NTFS	16384	on
Logical	10	quick	FAT	2048	on
Stripe	10000	quick	FAT32	512	off
Mirror	500	quick	FAT32	2048	on
Primary	10	slow	FAT32	16384	on
Single	10	quick	FAT	512	off
Single	10000	quick	FAT32	65536	off
Primary	40000	quick	NTFS	32768	on
Single	100	quick	FAT	8192	on
Span	5000	slow	FAT32	2048	on
Single	5000	quick	NTFS	16384	off
Logical	500	quick	NTFS	8192	off
RAID-5	5000	quick	NTFS	1024	on
Primary	1000	slow	FAT	1024	on
RAID-5	10000	slow	NTFS	8192	on
Logical	100	quick	NTFS	32768	off
Primary	10000	slow	FAT	32768	on
Stripe	40000	quick	FAT32	65536	on
Span	40000	quick	FAT	4096	on
Stripe	1000	quick	FAT	8192	off
Logical	1000	slow	FAT	4096	off
Primary	100	quick	FAT	2048	off
Single	40000	quick	FAT	1024	off
RAID-5	1000	quick	FAT	16384	on
Single	500	quick	FAT32	512	off
Stripe	10	quick	NTFS	2048	off
Primary	100	quick	NTFS	512	off
Logical	10000	slow	NTFS	1024	off
Mirror	5000	quick	FAT	512	on
Logical	5000	slow	NTFS	65536	off
Mirror	10000	slow	FAT	2048	off
RAID-5	10	slow	FAT32	65536	off
Span	100	quick	FAT	65536	on
Single	5000	quick	FAT	32768	on
Span	1000	quick	NTFS	65536	off
Primary	500	slow	FAT32	4096	off
Mirror	40000	slow	FAT32	4096	off
Mirror	10	slow	FAT32	1024	off
Logical	10000	quick	FAT	4096	off
Span	5000	slow	FAT	8192	off
RAID-5	40000	quick	FAT32	512	on
Primary	5000	quick	NTFS	1024	off
Mirror	100	slow	FAT32	16384	off

The first line is the header, and values/columns are separated by tabulator characters (tabs).

The output above is 62 generated combinations/test cases as evidenced by:

wc -l myoutputs/example1.txt 
      63 myoutputs/example1.txt

(wc-l counts lines, and the first line is the header so I substract 1)

To produce all 3-way combinations with PICT, the syntax is:

./pict mymodels/example1.pict >myoutputs/example1.txt /o:3

which generates 392 combinations/test cases:

wc -l myoutputs/example1.txt 
      393 myoutputs/example1.txt

I find the PICT command-line syntax a bit odd, as parameters have to be the last elements on the line, and they are identified by these strange symbols like “/o:”. But it works, so great.

Constraints

Of course, not all combinations are always valid. So PICT has extensive support to define constraints on the generator model, to limit what kind of combinations PICT generates. The PICT documentation page has lots of good examples. This part actually seems nicely documented. But let’s try a few just to see what happens. The basic example from the page:

Type:           Primary, Logical, Single, Span, Stripe, Mirror, RAID-5
Size:           10, 100, 500, 1000, 5000, 10000, 40000
Format method:  quick, slow
File system:    FAT, FAT32, NTFS
Cluster size:   512, 1024, 2048, 4096, 8192, 16384, 32768, 65536
Compression:    on, off

IF [File system] = "FAT"   THEN [Size] <= 4096;
IF [File system] = "FAT32" THEN [Size] myoutputs/example2.txt

wc -l myoutputs/example2.txt 
      63 myoutputs/example2.txt

So the same number of tests. The contents:

Type	Size	Format method	File system	Cluster size	Compression
Stripe	500	slow	NTFS	1024	on
Primary	500	quick	FAT32	512	off
Single	10	slow	FAT	1024	off
Single	5000	quick	FAT32	32768	on
Span	40000	quick	NTFS	16384	off
Mirror	40000	slow	NTFS	512	on
RAID-5	100	quick	FAT	8192	on
Logical	500	slow	FAT	2048	off
Span	10000	slow	FAT32	1024	on
Logical	1000	slow	FAT32	16384	on
Span	1000	quick	FAT	512	off
Primary	10	quick	NTFS	1024	on
Mirror	1000	quick	NTFS	4096	off
RAID-5	40000	slow	NTFS	1024	off
Single	40000	slow	NTFS	8192	off
Stripe	10	slow	FAT32	4096	on
Stripe	40000	quick	NTFS	2048	on
Primary	100	slow	NTFS	32768	off
Stripe	500	quick	FAT	16384	off
RAID-5	1000	quick	FAT32	2048	off
Mirror	10	quick	FAT	65536	off
Logical	40000	quick	NTFS	4096	on
RAID-5	5000	slow	NTFS	512	off
Stripe	5000	slow	FAT32	65536	on
Span	10	quick	FAT32	2048	off
Logical	10000	quick	NTFS	65536	off
Primary	1000	slow	FAT	65536	off
Mirror	500	quick	FAT	32768	on
Single	100	quick	FAT32	512	on
Mirror	5000	slow	FAT32	2048	on
Mirror	100	quick	NTFS	2048	on
Logical	5000	quick	FAT32	8192	off
Logical	100	slow	FAT32	1024	on
Primary	100	quick	FAT32	16384	off
Primary	10000	quick	FAT32	2048	on
RAID-5	10	slow	FAT	32768	off
Mirror	10	quick	FAT	16384	on
Single	500	slow	FAT	4096	on
Span	500	slow	FAT32	8192	on
Stripe	10000	quick	FAT32	32768	off
Logical	1000	slow	NTFS	32768	on
Single	10000	slow	NTFS	16384	off
Span	100	slow	FAT32	4096	on
Stripe	1000	slow	NTFS	8192	on
Span	5000	quick	NTFS	32768	on
Primary	5000	slow	FAT32	4096	off
RAID-5	100	slow	FAT	65536	off
RAID-5	10000	slow	FAT32	4096	on
Single	1000	quick	FAT	1024	on
Mirror	10	quick	FAT	1024	on
Logical	5000	slow	FAT32	1024	off
Single	500	slow	FAT32	65536	off
Logical	10	quick	NTFS	512	on
Single	1000	slow	FAT	2048	off
Mirror	10000	quick	NTFS	8192	on
Primary	10	quick	FAT32	8192	on
Primary	40000	slow	NTFS	32768	off
Stripe	100	slow	FAT	512	off
Mirror	10000	slow	FAT32	512	on
RAID-5	5000	quick	NTFS	16384	off
Span	40000	quick	NTFS	65536	on
RAID-5	500	quick	FAT	4096	on

In the “size” column vs the “File system” column, the “FAT” file system type now always has a size smaller than 4096. So it works as expected. I have to admit, I found the value 4096 very confusing here, since there is no option of 4096 in the input model for “size” but there is for “Cluster size”. I was looking at the wrong column initially, wondering why the constraint was not working. But it works, just a bit confusing example.

Similarly, 3-way combinations produce the same number of tests (as it did without any constraints) even with these constraints:

./pict mymodels/example2.pict >myoutputs/example2.txt /o:3

wc -l myoutputs/example2.txt 
     393 myoutputs/example2.txt

To experiment a bit more, I set a limit on FAT size to be 100 or less:

Type:           Primary, Logical, Single, Span, Stripe, Mirror, RAID-5
Size:           10, 100, 500, 1000, 5000, 10000, 40000
Format method:  quick, slow
File system:    FAT, FAT32, NTFS
Cluster size:   512, 1024, 2048, 4096, 8192, 16384, 32768, 65536
Compression:    on, off

IF [File system] = "FAT"   THEN [Size] <= 100;
IF [File system] = "FAT32" THEN [Size] myoutputs/example3.txt

wc -l myoutputs/example3.txt 
      62 myoutputs/example3.txt

./pict mymodels/example3.pict >myoutputs/example3.txt /o:3
wc -l myoutputs/example3.txt 
     397 myoutputs/example3.txt

What happened here?

Running the 2-way generator produces 61 tests. So the number of combinations generated was finally reduced by one with the additional constraint.

Running the 3-way generator produces 396 tests. So the number of tests/combinations generated was increased by 4, comparated to 3-way generator without this constraint. Which is odd, as I would expect the number of tests to go down, when there are fewer options. In fact, you could get a smaller number of tests by just by taking the 392 tests from the previous generator run with fewer constraints. Then take every line with “FAT” for “File system”, and if the “Size” for those is bigger than 100, replace it with either 100 or 10. This would be a max of 392 as it was before.

My guess is this is because building the set of inputs to cover all requested combinations is a very hard problem. I believe in computer science this would be called an NP-hard problem (or so I gather from the academic literature for combinatorial testing, even if they seem to call the test set a “covering array” and other academic tricks). So no solution is known that would produce the optimal result. The generator will then have to accomodate all the possible constraints in its code, and ends up taking some tricks here that result in slighly bigger set. It is still likely a very nicely optimized set. Glad it’s not me having to write those algorithms :). I just use them and complain :).

PICT has a bunch of other ways to define conditional constraints with the use of IF, THEN, ELSE, NOT, OR, AND statements. The docs cover that nicely. So lets not go there.

The Naming Trick

Something I found interesting is a way to build models by naming different items separately, and constraining them separately:

#
# Machine 1
#
OS_1:   Win7, Win8, Win10
SKU_1:  Home, Pro
LANG_1: English, Spanish, Chinese

#
# Machine 2
#
OS_2:   Win7, Win8, Win10
SKU_2:  Home, Pro
LANG_2: English, Spanish, Chinese, Hindi

IF [LANG_1] = [LANG_2]
THEN [OS_1]  [OS_2] AND [SKU_1]  [SKU_2];

Here we have two items (“machines”) with the same three properties (“OS”, “SKU”, “LANG”). However, by numbering the properties, the generator sees them as different. From this, the generator can now build combinations of different two-machine configurations, using just the basic syntax and no need to tweak the generator itself. The only difference between the two is that “Machine 2” can have one additional language (“Hindi”).

The constraint at the end also nicely ensures that if the generated configurations have the same language, the OS and SKU should be different.

Scaling these “machine” combinations to a large number of machines would require a different type of an approach. Since it is doubtful anyone would like to write a model with 100 machines, each separately labeled. No idea what modelling approach would be the best for that, but right now I don’t have a big requirement for it, so not going there. Maybe a different approach of having the generator produce a more abstract set of combinations, and map those to large number of “machines” somehow.

Repetition and Value References

There is quite a bit of repetition in the above model with both machines repeating all the same parameter values. PICT has a way to address this by referencing values defined for other parameters:

#
# Machine 1
#
OS_1:   Win7, Win8, Win10
SKU_1:  Home, Pro
LANG_1: English, Spanish, Chinese

#
# Machine 2
#
OS_2:   
SKU_2:  
LANG_2: , Hindi

So in this case, “machine 2” is repeating the values from “machine 1”, and changing them in “machine 1” also changes them in “machine 2”. Sometimes that is good, other times maybe not. Because changing one thing would change many, and you might not remember that every time. On the other hand, you would not want to be manually updating all items with the same info every time. But a nice feature to have if you need it.

Data Types

With regards to variable types, PICT supports numbers and strings. So this is given as an example model:

Size:  1, 2, 3, 4, 5
Value: a, b, c, d

IF [Size] > 3 THEN [Value] > "b";

I guess the two types are because you can then define different types of constraints on them. For example, “Size” > 3 makes sense. The part of “value” > 3 a bit less.. So let’s try that:

./pict mymodels/example4.pict >myoutputs/example4.txt

wc -l myoutputs/example4.txt 
      17 myoutputs/example4.txt

The output looks like this:

Size	Value
3	a
2	c
1	c
2	b
2	a
1	d
1	a
3	b
4	d
2	d
3	d
1	b
5	c
3	c
4	c
5	d

And here, if “Size” equals 4 or 5 (so is >3), “Value” is always “c” or d”. The PICT docs state “String comparison is lexicographical and case-insensitive by default”. So [> “b”] just refers to letters coming after “b”, which equals “c” and “d” in the choices in this model. It seems a bit odd to define such comparisons against text in a model, but I guess it can help make a model more readable if you can represent values as numbers or strings, and define constraints on them in a similar way.

To verify, I try a slightly modified model:

./pict mymodels/example4.pict >myoutputs/example4.txt

wc -l myoutputs/example4.txt 
      13 myoutputs/example4.txt

So, the number of tests is reduced from 16 to 12. Results in the following output:

Size	Value
5	c
2	c
1	d
4	d
1	b
4	c
3	d
3	c
2	d
1	c
1	a
5	d

Which confirms that lines (tests) with Size > 2 now have only letters “c” or “d” in them. This naturally also limits the number of available combinations, hence the reduced test set.

Extra Features

There are some nice features that are nicely explained in the PICT docs:

  • Submodels: Refers to defining levels of combinations per test. For example, 2-way combinations of OS with all others, and 3-way combination of File System Type with all others, at the same time.
  • Aliasing: You can give the same parameter several names and all are treated the same. Not sure why you want to do that but anyway.
  • Weighting: Since the full set of combinations will have more of some values anyway, this can be used to set preference for specific ones.‚Äė

Negative Testing / Erronous Values

A few more interesting ones are “negative testing” and “seeding”. So first negative testing. Negative testing refers to having a set of exclusive values. So those values should never appear together. This is because each of them is expected to produce an error. So you want to make sure the error they produce is visible and not “masked” (hidden) by some other erronous value.

The example model from PICT docs, with a small modification to name the invalid values differently:

#
# Trivial model for SumSquareRoots
#

A: ~-1, 0, 1, 2
B: ~-2, 0, 1, 2

Running it, we get:

./pict mymodels/example5.pict >myoutputs/example5.txt

wc -l myoutputs/example5.txt 
      16 myoutputs/example5.txt
A	B
0	2
0	1
1	2
2	1
1	0
2	0
1	1
2	2
0	0
0	~-2
1	~-2
~-1	0
~-1	1
2	~-2
~-1	2

The negative value is prefixed with “~”, and the results show combinations of the two negative values with all possible values of the other variable. So if A is -1, it is combined with 0, 1, 2 for B. If B is -2 it is combinted with 0, 1, 2 for A. But -1 and -2 are never paired. To avoid one “faulty” variable masking the other one. I find having the “~” added everywhere a bit distracting. But I guess you could parse around it, not a real issue.

Of course, there is nothing to stop us from setting the set of possible values to include -1 and -2, and get combinations of several “negative” values. Lets try:

A: -1, 0, 1, 2
B: -2, 0, 1, 2
./pict mymodels/example6.pict >myoutputs/example6.txt
wc -l myoutputs/example6.txt 
      17 myoutputs/example6.txt
A	B
1	-2
2	0
1	0
-1	0
0	-2
2	1
-1	-2
0	0
1	2
-1	2
0	2
2	-2
1	1
-1	1
0	1
2	2

So there we go. This produced one test more than the previous one. And that would be the one where both the negatives are present. Line with “-1” and “-2” together.

Overall, the “~” notation seems like just a way to avoid having a set of variables appear together. Convenient, and good way to optimize more when you have large models, big input spaces, slow tests, difficult problem reports, etc.

Seeding / Forcing Tests In

Seeding. When I hear seeding in test generation, I think about the seed value for a random number generator. Because often those are used to help generate tests.. Well, with PICT it actually means you can predine a set of combinations that need to be a part of the final test set.

So lets try with the first example model from above:

Type:          Primary, Logical, Single, Span, Stripe, Mirror, RAID-5
Size:          10, 100, 500, 1000, 5000, 10000, 40000
Format method: quick, slow
File system:   FAT, FAT32, NTFS
Cluster size:  512, 1024, 2048, 4096, 8192, 16384, 32768, 65536
Compression:   on, off

The seed files should be the same format as the output produced by PICT. Lets say I want to try all types with all file systems, using smallest size. So I try with this:

Type	Size	Format method	File system	Cluster size	Compression
Primary	10		FAT32		on
Logical	10		FAT32		on
Single	10		FAT32		on
Span	10		FAT32		on
Stripe	10		FAT32		on
Mirror	10		FAT32		on
RAID-5	10		FAT32		on
Primary	10		FAT		on
Logical	10		FAT		on
Single	10		FAT		on
Span	10		FAT		on
Stripe	10		FAT		on
Mirror	10		FAT		on
RAID-5	10		FAT		on
Primary	10		NTFS		on
Logical	10		NTFS		on
Single	10		NTFS		on
Span	10		NTFS		on
Stripe	10		NTFS		on
Mirror	10		NTFS		on
RAID-5	10		NTFS		on

To run it:

./pict mymodels/example7.pict /e:mymodels/example7.seed >myoutputs/example7.txt
 wc -l myoutputs/example7.txt 
      73 myoutputs/example7.txt

So in the beginning of this post, the initial model generated 62 combinations. With this seed file, some forced repetition is there and the size goes up to 72. Still not that much bigger, but I guess shows something about how nice it is to have a combinatorial test tool to optimize this type of test set for you.

The actual output:

Type	Size	Format method	File system	Cluster size	Compression
Primary	10	quick	FAT32	2048	on
Logical	10	slow	FAT32	16384	on
Single	10	slow	FAT32	65536	on
Span	10	quick	FAT32	1024	on
Stripe	10	quick	FAT32	8192	on
Mirror	10	quick	FAT32	512	on
RAID-5	10	slow	FAT32	32768	on
Primary	10	slow	FAT	4096	on
Logical	10	quick	FAT	1024	on
Single	10	quick	FAT	32768	on
Span	10	slow	FAT	512	on
Stripe	10	slow	FAT	16384	on
Mirror	10	slow	FAT	8192	on
RAID-5	10	slow	FAT	2048	on
Primary	10	quick	NTFS	65536	on
Logical	10	quick	NTFS	4096	on
Single	10	slow	NTFS	16384	on
Span	10	quick	NTFS	32768	on
Stripe	10	slow	NTFS	1024	on
Mirror	10	slow	NTFS	2048	on
RAID-5	10	quick	NTFS	512	on
Span	40000	slow	FAT	65536	off
Single	5000	quick	NTFS	8192	off
Mirror	1000	quick	FAT32	4096	off
Stripe	100	slow	FAT	32768	off
Primary	500	slow	FAT	512	off
Primary	40000	quick	NTFS	8192	on
Logical	10000	quick	NTFS	32768	off
RAID-5	40000	slow	FAT32	1024	off
Span	100	quick	NTFS	8192	on
Mirror	10000	slow	FAT32	16384	off
Logical	5000	slow	FAT	512	on
Primary	1000	slow	FAT	1024	on
Mirror	5000	quick	FAT32	1024	on
Logical	1000	quick	NTFS	32768	on
Single	40000	slow	FAT32	512	on
Stripe	40000	quick	FAT	16384	on
Logical	100	quick	FAT32	2048	off
Single	100	quick	FAT32	1024	off
Primary	5000	quick	NTFS	32768	off
Single	40000	slow	NTFS	2048	on
Logical	500	quick	FAT32	8192	on
Single	500	slow	NTFS	4096	on
Span	500	quick	FAT32	16384	on
Primary	100	quick	FAT32	512	off
Stripe	1000	slow	FAT32	2048	on
RAID-5	10000	quick	FAT	8192	on
Stripe	10000	slow	NTFS	512	off
Stripe	5000	quick	FAT	65536	on
Mirror	40000	slow	NTFS	32768	on
Primary	10000	quick	NTFS	1024	on
RAID-5	100	quick	FAT	16384	off
Mirror	500	quick	NTFS	1024	on
Single	1000	slow	FAT32	512	on
Span	100	slow	FAT32	4096	off
Span	5000	slow	NTFS	2048	on
RAID-5	40000	slow	FAT	4096	off
Span	1000	slow	FAT32	16384	on
Mirror	100	quick	FAT	65536	on
Single	10000	slow	FAT	4096	off
RAID-5	1000	slow	NTFS	65536	off
Span	10000	slow	NTFS	65536	on
Span	1000	slow	FAT32	8192	off
RAID-5	500	quick	NTFS	32768	off
Stripe	500	slow	FAT	2048	off
RAID-5	5000	slow	NTFS	16384	on
Stripe	5000	slow	FAT32	4096	off
Logical	10	slow	FAT	65536	off
RAID-5	10000	quick	NTFS	2048	on
Primary	1000	slow	FAT	16384	off
Logical	40000	quick	FAT32	8192	on
Primary	500	quick	FAT	65536	on

This output starts with the seeds given, and PICT has done its best to fill in the blanks with such values as to still minimize the test numbers while meeting the combinatorial coverage requirements.

Personal Thoughts

Searching for PICT and pairwise testing or combinatorial testing brings up a bunch of results and reasonably good articles on the topic. Maybe even more of such practice oriented ones than model-based testing. Maybe because it is simpler to apply, and thus easier to pick up and go in practice?

For example, this has a few good points. One is to use an iterative process to build the input models. So as with everything else, not to expect to get it all perfectly right from the first try. Another is to consider invariants for test oracles. So things that should always hold, such as two nodes in a distributed system never being in a conflicting state when an operation involving both is done. Of course, this would also apply to any other type of testing. The article seems to consider this also from a hierarchical viewpoint, checking the strictest or most critical ones first.

Another good point in that article is to use readable names for the values. I guess sometimes people could use the PICT output as such, to define test configurations and the like for manual testing. I would maybe considering using them more as input for automated test execution to define parameter values to cover. In such cases, it would be enough to give each value a short name such as “A”, “A1”, or “1”. But looking at the model and the output, it would be difficult to define which value would map to which symbol. Readable names are just as parseable for the computer but much more so for the human expert.

Combining with Sequential Models

So this is all nice and shiny, but the examples are actually quite simple test scenarios. There are no complex dependencies between them, not complex state that defines what parameters and values are available, and so on. It mostly seems to vary around what combinations of software or system configurations should be used in testing.

I have worked plenty with model-based testing myself (see OSMO), and actually have talked to some people who have done combinations of combinatorial input generation and model-based testing. I can see how this could be interested, to identify a set of injection points for parameters and values in a MBT model, and use a combinatorial test data generator to build data sets for those injection points. Likely doing some more of this in practice would reveal good insights on what works and what could be done to make the match even better. Maybe someday.

In any case, I am sure combining combinatorial test datasets would also work great with other types of sequences as well. I think this could make a very interesting and practical research topic. Again, maybe someday..

Bye Now

In general, this area seems to have great tools for the basic test generation, but missing some in-depth experiences and guides for how to apply to more complex software. Together with sequential test cases and test generators.

A simpler, yet interesting topic to do would be to integrate the PICT type generator directly with the test environment. Run the combinatorial generator from this during the test runs, and have it randomize the combinations in a bit different ways during different runs. While still maintaining the overall combinatorial coverage.

Finnish Topic Modelling

Previously I wrote about a few experiments I ran with topic-modelling. I briefly glossed over having some results for a set of Finnish text as an example of a smaller dataset. This is a bit deeper look into that..

I use two datasets, the Finnish wikipedia dump, and the city of Oulu board minutes. Same ones I used before. Previously I covered topic modelling more generally, so I won’t go into too much detail here. To summarize, topic modelling algorithms (of which LDA or Latent Dirilect Allocation is used here) find sets of words with different distributions over sets of documents. These are then called the “topics” discussed in those documents.

This post looks at how to use topic models for a different language (besides English) and what could one maybe do with the results.

Lemmatize (turn words into baseforms before use) or not? I choose to lemmatize for topic modelling. This seems to be the general consensus when looking up info on topic modelling, and in my experience it just gives better results as the same word appears only once. I covered POS tagging previously, and I believe it would be useful to apply here as well, but I don’t. Mostly because it is not needed to test these concepts, and I find the results are good enough without adding POS tagging to the mix (which has its issues as I discussed before). Simplicity is nice.

I used the Python Gensim package for building the topic models. As input, I used the Finnish Wikipedia text and the city of Oulu board minutes texts. I used my existing text extractor and lemmatizer for these (to get the raw text out of the HTML pages and PDF docs, and to baseform them, as discussed in my previous posts). I dumped the lemmatized raw text into files using slight modifications of my previous Java code and the read the docs from those files as input to Gensim in a Python script.

I started with the Finnish Wikipedia dump, using Gensim to provide 50 topics, with 1 pass over the corpus. First 10 topics that I got:

  • topic0=focus[19565] var[8893] liivi[7391] luku[6072] html[5451] murre[3868] verkkoversio[3657] alku[3313] joten[2734] http[2685]
  • topic1=viro[63337] substantiivi[20786] gen[19396] part[14778] taivutus[13692] tyyppi[6592] t√§ysi[5804] taivutustyyppi[5356] liite[4270] rakenne[3227]
  • topic2=isku[27195] pieni[10315] tms[7445] aine[5807] v√§ri[5716] raha[4629] suuri[4383] helppo[4324] saattaa[4044] heprea[3129]
  • topic3=suomi[89106] suku[84950] substantiivi[70654] pudottaa[59703] kasvi[46085] k√§√§nn√∂s[37875] luokka[35566] sana[33868] kieli[32850] taivutusmuoto[32067]
  • topic4=ohjaus[129425] white[9304] off[8670] black[6825] red[5066] sotilas[4893] fraasi[4835] yellow[3943] perinteinen[3744] flycatcher[3735]
  • topic5=lati[48738] eesti[25987] www[17839] http[17073] keele[15733] eki[12421] l√§hde[11306] dict[11104] s√Ķnaraamat[10648] tallinn[8504]
  • topic6=suomi[534914] k√§√§nn√∂s[292690] substantiivi[273243] aihe[256126] muualla[254788] sana[194213] liittyv√§[193298] etymologi[164158] viite[104417] kieli[102489]
  • topic7=italia[66367] substantiivi[52038] japani[27988] inarinsaame[9464] kohta[7433] yhteys[7071] vaatekappale[5553] rinnakkaismuoto[5469] taas[4986] voimakas[3912]
  • topic8=sana[548232] liittyv√§[493888] substantiivi[298421] ruotsi[164717] synonyymi[118244] alas[75430] etymologi[64170] liikuttaa[38058] johdos[25603] yhdyssana[24943]
  • topic9=juuri[3794] des[3209] jumala[1799] tadŇĺikki[1686] tuntea[1639] tekij√§[1526] tulo[1523] mitta[1337] jatkuva[1329] levy[1197]
  • topic10=t√∂rm√§t√§[22942] user[2374] sur[1664] self[1643] hallita[1447] voittaa[1243] piste[1178] data[1118] harjoittaa[939] jstak[886]

The format of the topic list I used here is “topicX=word1[count] word2[count]”, where X is the number of the topic, word1 is the first word in the topic, word2 the second, and so on. The [count] is how many times the word was associated with the topic in different documents. Consider it the strength, weight, or whatever of the word in the topic.

So just a few notes on the above topic list:

  • topic0 = mostly website related terms, interleaved with a few odd ones. Examples of odd ones; “liivi” = vest, “luku” = number/chapter (POS tagging would help differentiate), “murre” = dialect.
  • topic1 = mostly Finnish language related terms. “viro” = estonia = slightly odd to have here. It is the closest related language to Finnish but still..
  • topic3 = another Finnish language reated topic. Odd one here is “kasvi” = plant. Generally this seems to be more related to words and their forms, where as topic1 maybe more about structure and relations.
  • topic5 = estonia related

Overall, I think this would improve given more passes over the corpus to train the model. This would give the algorithm more time and data to refine the model. I only ran it with one pass here since the training for more topics and with more passes started taking days and I did not have the resources to go there.

My guess is also that with more data and more broader concepts (Wikipedia covering pretty much every topic there is..) you would also need more topics that the 50 I used here. However, I had to limit the size due to time and resource constraints. Gensim probably also has more advanced tuning options (e..g, parallel runs) that would benefit the speed. So I tried a few more sizes and passes with the smaller Oulu city board dataset, as it was faster to run.

Some topics for the city of Oulu board minutes, run for 20 topics and 20 passes over the training data:

  • topic0=oulu[2096] kaupunki[1383] kaupunginhallitus[1261] 2013[854] p√§iv√§m√§√§r√§[575] vuosi[446] p√§√§t√∂sesitys[423] j√§sen[405] hallitus[391] tieto[387]
  • topic1=kunta[52] palvelu[46] asiakaspalvelu[41] yhteinen[38] viranomainen[25] laki[24] valtio[22] my√∂s[20] asiakaspalvelupiste[19] kaupallinen[17]
  • topic2=oulu[126] palvelu[113] kaupunki[113] koulu[89] tukea[87] edist√§√§[71] vuosi[71] osa[64] nuori[63] toiminta[61]
  • topic3=tontti[490] kaupunki[460] oulu[339] asemakaava[249] rakennus[241] kaupunginhallitus[234] p√§iv√§m√§√§r√§[212] yhdyskuntalautakunta[206] muutos[191] alue[179]
  • topic5=kaupunginhallitus[1210] p√§√§t√∂s[1074] j√§sen[861] oulu[811] kaupunki[681] p√∂yt√§kirja[653] klo[429] p√§iv√§m√§√§r√§[409] oikaisuvaatimus[404] matti[316]
  • topic6=000[71] 2012[28] oulu[22] muu[20] tilikausi[16] vuosi[16] yhde[15] kunta[14] 2011[13] 00000[13]
  • topic8=alue[228] asemakaava[96] rakentaa[73] tulla[58] oleva[56] rakennus[55] merkitt√§v√§[53] kortteli[53] oulunsalo[50] nykyinen[48]
  • topic10=asiakirjat.ouka.fi[15107] ktwebbin[15105] 2016[7773] eet[7570] pk_asil_tweb.htm?[7551] ktwebscr[7550] dbisa.dll[7550] url=http[7540] doctype[7540] =3&docid[7540]
  • topic11=yhti√∂[31] osake[18] osakas[11] energia[10] hallitus[10] 18.11.2013[8] liite[7] lomautus[6] s√§hk√∂[6] osakassopimus[5]
  • topic12=13.05.2013[13] perlacon[8] kuntatalousfoorumi[8] =1418[6] meeting_date=21.3.2013[6] =2070[6] meeting_date=28.5.2013[6] =11358[5] meeting_date=3.10.2016[5] -31.8.2015[4]
  • topic13=001[19] oulu[11] 002[5] kaupunki[4] sivu[3] ÔŅĹÔŅĹÔŅĹ[3] palvelu[3] the[3] asua[2] and[2]

Some notes on the topics above:

  • The word “oulu” repeats in most of the topics. This is quite natural as all the documents are from the board of the city of Oulu. Depending on the use case for the topics, it might be useful to add this word to the list of words to be removed in the pre-cleaning phase for the documents before running the topic modelling algorithm. Or it might be useful information, along with the weight of the word inside the topic. Depends.
  • topic0 = generally about the board structure. For example, “kaupunki”=city, “kaupunginhallitus”=city board, “p√§iv√§m√§√§r√§”=date, “p√§√§t√∂sesitys”=proposal for decision.
  • topic1 = Mostly city service related words. For example, “kunta” = county, “palvelu” = service, “asiakaspalvelu” = customer service, “my√∂s” = also, so something to add to the cleaners again.
  • topic2 = School related. For example, “koulu” = school, “tukea” = support, … Sharing again common words such as “kaupunki” = city, which may also be considered for removal or not depending on the case.
  • topic3 = City area planning related. For example, “tontti” = plot of land, “asemakaava” = zoning plan, …
  • In general quite good and focused topics here, so I think in general quite a good result. Some exceptions to consider:
  • topic10 = mostly garbage related to HTML formatting and website link structures. still a real topic of course, so nicely identified.. I think something to consider to add to the cleaning list for pre-processing.
  • topic12 = Seems related to some city finance related consultation (perlacon seems to be such as company) and associated event (the forum). With a bunch of meeting dates.
  • topic13 = unclear garbage
  • So in general, I guess reasonably good results but in real applications, several iterations of fine-tuning the words, the topic modelling algorithm parameters, etc. based on the results would be very useful.

So that was the city minutes topics for a smaller set of topics and more passes. What does it look for 100 topics, and how does the number of passes over the corpus affect the larger size? more passes should give the algorithm more time to refine the topics, but smaller datasets might not have so many good topics..

For 100 topics, 1 passes, 10 first topics:

  • topic0=oulu[55] kaupunki[22] 000[20] sivu[14] palvelu[14] alue[13] vuosi[13] muu[11] uusi[11] tavoite[9]
  • topic1=kaupunki[18] oulu[17] j√§sen[15] 000[10] kaupunginhallitus[7] kaupunginjohtaja[6] klo[6] muu[5] vuosi[5] takaus[4]
  • topic2=hallitus[158] oulu[151] 25.03.2013[135] kaupunginhallitus[112] j√§sen[105] varsinainen[82] tilintarkastaja[79] kaupunki[75] valita[70] yhti√∂kokousedustaja[50]
  • topic3=kuntalis√§[19] oulu[16] palkkatuki[15] kaupunki[14] tervahovi[13] henkil√∂[12] tukea[12] yritys[10] kaupunginhallitus[10] ty√∂t√∂n[9]
  • topic4=koulu[37] oulu[7] sahantie[5] 000[5] √§√§nestyspaikka[4] maikkulan[4] kaupunki[4] kirjasto[4] monitoimitalo[3] kello[3]
  • topic5=oulu[338] kaupunki[204] euro[154] kaupunginhallitus[143] 2013[105] vuosi[96] milj[82] palvelu[77] kunta[71] uusi[64]
  • topic6=000[8] oulu[7] kaupunki[4] vuosi[3] 2012[3] muu[3] kunta[2] muutos[2] 2013[2] sivu[1]
  • topic7=000[5] 26.03.2013[4] oulu[3] 2012[3] kunta[2] vuosi[2] kirjastoj√§rjestelm√§[2] muu[1] kaupunki[1] muutos[1]
  • topic8=oulu[471] kaupunki[268] kaupunginhallitus[227] 2013[137] p√§iv√§m√§√§r√§[97] p√§√§t√∂s[93] vuosi[71] tieto[67] 000[66] p√§√§t√∂sesitys[64]
  • topic9=oulu[5] lomautus[3] 000[3] kaupunki[2] s√§√§st√∂toimenpidevapaa[1] vuosi[1] kunta[1] kaupunginhallitus[1] sivu[1] henkil√∂st√∂[1]
  • topic10=oulu[123] kaupunki[82] alue[63] sivu[43] rakennus[42] asemakaava[39] vuosi[38] tontti[38] 2013[35] osa[35]

Without going too much into translating every word, I would say these results are too spread out, so from this, for this dataset, it seems a smaller set of topics would do better. This also seems to be visible in the word counts/strengths in the [square brackets]. The topics with small weights also seem pretty poor topics, while the ones with bigger weights look better (just my opinion of course :)). Maybe something to consider when trying to explore the number of topics etc.

And the same run, this time with 20 passes over the corpus (100 topics and 10 first ones shown):

  • topic0=oulu[138] kaupunki[128] palvelu[123] toiminta[92] kehitt√§√§[73] my√∂s[72] tavoite[62] osa[55] vuosi[50] toteuttaa[44]
  • topic1=-seurantatieto[0] 2008-2010[0] =30065[0] =170189[0] =257121[0] =38760[0] =13408[0] oulu[0] 000[0] kaupunki[0]
  • topic2=harmaa[2] tilaajavastuulaki[1] tilaajavastuu.fi[1] torjunta[1] -palvelu[1] talous[0] harmaantalous[0] -30.4.2014[0] hankintayksikk√∂[0] kilpailu[0]
  • topic3=juhlavuosi[14] 15.45[11] perussopimus[9] reilu[7] kauppa[6] juhlatoimikunta[6] ty√∂paja[6] 24.2.2014[6] 18.48[5] tapahtumatuki[4]
  • topic4=kokous[762] kaupunginhallitus[591] p√§√§t√∂s[537] p√∂yt√§kirja[536] ty√∂j√§rjestys[362] hyv√§ksy√§[362] tarkastaja[360] esityslista[239] valin[188] p√§√§t√∂svaltaisuus[185]
  • topic5=koulu[130] sivistys-[35] suuralue[28] perusopetus[25] tilakeskus[24] kulttuurilautakunta[22] j√§rjest√§√§[22] korvensuora[18] p√§iv√§kota[17] p√§iv√§koti[17]
  • topic6=piste[24] hanke[16] toimittaja[12] hankesuunnitelma[12] tila[12] toteuttaa[11] hiukkavaara[10] hyvinvointikeskus[10] tilakeskus[10] monitoimitalo[9]
  • topic7=tiedekeskus[3] museo-[2] prosenttipohjainen[2] taidehankinta[1] uudisrakennushanke[1] hankintam√§√§r√§raha[1] prosenttitaide[1] hankintaprosessi[0] toteutusajankohta[0] ulosvuokrattava[0]
  • topic8=euro[323] milj[191] vuosi[150] oulu[107] talousarvio[100] tilinp√§√§t√∂s[94] kaupunginhallitus[83] kaupunki[79] 2012[73] 2013[68]
  • topic9=p√§√§t√∂s[653] oikaisuvaatimus[335] oulu[295] kaupunki[218] p√§iv√§[215] voi[211] kaupunginhallitus[208] posti[187] p√∂yt√§kirja[161] viimeinen[154]

Even the smaller topics here seem much better now with the increase in passes over the corpus. So perhaps the real difference just comes from having enough passes over the data, giving the algorithms more time and data to refine the models. At least I would not try without multiple passes based on comparing the results here of 1 vs 20 passes.

For example, topic2 here has small numbers but still all items seem related to grey market economy. Similarly, topic7 has small numbers but the words are mostly related to arts and culture.

So to summarize, it seems lemmatizing your words, exploring your parameters, and ensuring to have a decent amount of data and decent number of passes for the algorithm are all good points. And properly cleaning your data, and iterating over the process many times to get these right (well, as “right”as you can).

To answer my “research questions” from the beginning: topic modelling for different languages and use cases for topic modelling.

First, lemmatize all your data (I prefer it over stemming but it can be more resource intensive). Clean all your data from the typical stopwords for your language, but also for your dataset and domain. Run the models and analysis several times, and keep refining your list of removed words to clean also based on your use case, your dataset and your domain. Also likely need to consider domain specific lemmatization rules as I already discussed with POS tagging.

Secondly, what use cases did I find looking at topic modelling use cases online? Actually, it seems really hard to find concrete actual reports of uses for topic models. Quora has usually been promising but not so much this time. So I looked at reports in the published research papers instead, trying to see if any companies were involved as well.

Some potential use cases from research papers:

Bug localization, as in finding locations of bugs in source code is investigated here. Source code (comments, source code identifiers, etc) is modelled as topics, which are mapped to a query created from a bug report.

Matching duplicates of documents in here. Topic distributions over bug reports are used to suggest duplicate bug reports. Not exact duplicates but describing the same bug. If the topic distributions are close, flag them as potentially discussing the same “topic” (bug).

Ericsson has used topic models to map incoming bug reports to specific components. To make resolving bugs easier and faster by automatically assigning them to (correct) teams for resolution. Large historical datasets of bug reports and their assignments to components are used to learn the topic models. Topic distributions of incoming bug reports are used to give probability rankings for the bug report describing a specific component, in comparison to topic distributions of previous bug reports for that component. Topic distributions are also used as explanatory data to present to the expert looking at the classification results. Later, different approaches are reported at Ericsson as well. So just to remind that topic models are not the answer to everything, even if useful components and worth a try in places.

In cyber security, this uses topic models to describe users activity as distributions over the different topics. Learn topic models from user activity logs, describe each users typical activity as a topic distribution. If a log entry (e.g., session?) diverges too much from this topic distribution for the user, flag it as an anomaly to investigate. I would expect simpler things could work for this as well, but as input for anomaly detection, an interesting thought.

Tweet analysis is popular in NLP. This is an example of high-level tweet topic classification: Politics, sports, science, … Useful input for recommendations etc., I am sure. A more targeted domain specific example is of using topics in Typhoon related tweet analysis and classification: Worried, damage, food, rescue operations, flood, … useful input for situation awareness, I would expect. As far as I understood, topic models were generated, labeled, and then users (or tweets) assigned to the (high-level) topics by topic distributions. Tweets are very small documents, so that is something to consider, as discussed in those papers.

Use of topics models in biomedicine for text analysis. To find patterns (topic distributions) in papers discussing specific genes, for example. Could work more broadly as one tool to explore research in an area, to find clusters of concepts in broad sets of research papers on a specific “topic” (here a research on a specific gene). Of course, there likely exist number of other techniques to investigate for that as well, but topic models could have potential.

Generally labelling and categorizing large number of historical/archival documents to assist users in search. Build topic models, have experts review them, and give the topics labels. Then label your documents based on their topic distributions.

Bit further outside the box, split songs into segments based on their acoustic properties, and use topic modelling to identify different categories/types of music in large song databases. Then explore the popularity of such categories/types over time based on topic distributions over time. So the segments are your words, and the songs are your documents.

Finding image duplicates of images in large data sets. Use image features as words, and images as documents. Build topic models from all the images, and find similar types of images by their topic distributions. Features could be edges, or even abstract ones such as those learned by something like a convolutional neural nets. Assists in image search I guess..

Most of these uses seem to be various types of search assistance, with a few odd ones thinking outside the box. With a decent understanding, and some exploration, I think topic models can be useful in many places. The academics would sayd “dude XYZ would work just as well”. Sure, but if it does the job for me, and is simple and easy to apply..