Investigating ROC/AUC

Executive Summary

ROC and AUC are terms that often come up in machine learning, in relation to evaluating models. In this post, I try examine what ROC curves actually are, how they are calculated, what is a threshold in ROC curve, and how it impacts the classification if you change it. The results show that using low threshold values leads to classifying more objects into positive category, and high threshold leads to more negative classifications. The Titanic dataset is used here as an example to study the threshold effect. A classifier gives probabilities for people to survive.

By default, the classifier seems to use 50% probability as a threshold to classify one as a survivor (positive class) or non-survivor (negative class). But this threshold can be varied. For example, using a threshold of 1% leads to giving a survivor label to anyone who gets a 1% or higher probability to survive by the classifier. Similarly, a threshold of 95% requires people to get at least 95% probability from the classifier to be given a survivor label. This is the threshold that is varied, and the ROC curve visualizes how the change in this probability threshold impacts classification in terms of getting true positives (actual survivors predicted as survivors) vs getting false positives (non-survivors predicted as survivors).

For the loooong story, read on. I dare you. 🙂

Introduction

Area under the curve (AUC) and receiver operator characteristic (ROC) are two terms that seem to come up a lot when learning about machine learning. This is my attempt to get it. Please do let me know where it goes wrong..

AUC is typically drawn as a curve using some figure like this (from Wikipedia):

Roccurves
FIGURE 1. Example ROC/AUC curve.

It uses true positive rate (TPR) and false positive rate (FPR) as the two measures to compare. A true positive (TP) being a correct positive prediction and a false positive (FP) being a wrong positive prediction.

There is an excellent introduction to the topic in the often cited An introduction to ROC analysis article by Tom Fawcett. Which is actually a very understandable article (for most part), unlike most academic articles I read. Brilliant. But still, I wondered, so lets see..

To see ROC/AUC in practice, instead of just reading about it all over the Internet, I decided to implement a ROC calculation and see if I can match it with the existing implementations. If so, that should at least give me some confidence on my understanding on the topic being correct.

Training a classifier to test ROC/AUC

To do this, I needed a dataset and a fitting of some basic classifier on that data to run my experiments. I have recently been going through some PySpark lectures on Udemy, so I could maybe learn some more big data stuffs and get an interesting big data job someday. Woohoo. Anyway. The course was using the Titanic dataset, so I picked that up, and wrote a simple classifier for it in Pandas/Scikit. Being a popular dataset, there is also a related Kaggle for it. Which is always nice, allowing me to create everything a bit faster by using the references, and focus on the ROC investigation faster.

The classifier I used is Logistic Regression, and the notebook is available on my GitHub (GitHub sometimes seems to have issues rendering but its there, will look into putting also on Kaggle later). Assuming some knowledge of Python, Pandas, sklearn, and the usual stuff, the training of the model itself is really the same as usual:

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=2)
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

The above is of course just a cut off into the middle of the notebook, where I already pre-processed the data to build some set of features into the features variable, and took the prediction target (survived) into the target variable. But that is quite basic ML stuff you find in every article on the topic, and on many of the “kernels” on the Kaggle page I linked. For more details, if all else fails, check my notebook linked above.

The actual training of the classifier (after preprocessings..) is as simple as the few lines above. That’s it. I then have a Logistic Regression classifier I can use to try and play with ROC/AUC. Since ROC seems to be about playing with thresholds over predicted class probabilities, I pull out the target predictions and their given probabilities for this classifier on the Titanic dataset:

predictions = logreg.predict(X_test)
predicted_probabilities = logreg.predict_proba(X_test)

Now, predicted_probabilities holds the classifier predicted probability values for each row in the dataset. 0 for never going to survive, 1 for 100% going to survive. Just predictions, of course, not actual facts. That’s the point of learning a classifier, to be able to predict the result for instance where you don’t know in advance, as you know.. 🙂

Just to see a few things about the data:

predicted_probabilities.shape

>(262, 2)

X_test.shape

>(262, 9)

This shows the test set has 262 rows (data items), there are 9 feature variables I am using, and the prediction probabilities are given in 2 columns. The classifier is a binary classifier, giving predictions for a given data instance as belonging to one of the two classes. In this case it is survived and not survived. True prediction equals survival, false prediction equals non-survival. The predicted_probability variable contains probabilities for false in column 0 and true in column 1. Since probability of no survival (false) is the opposite of survival (1-survival_probability, (true)), we really just need to keep one of those two columns. Because if it is not true, it has to be false. Right?

Cutting the true/false predictions to only include the true prediction, or the probability of survival (column 1, all rows):

pred_probs = predicted_probabilities[:, 1]

Drawing some ROC curves

With this, getting the ROC curve from sklearn is as simple as:

[fpr, tpr, thr] = roc_curve(y_test, pred_probs)

The Kaggle notebook I used as a reference, visualizes this and also drew fancy looking two dashed blue lines to illustrate a point on the ROC curve, where 95% of the surviving passengers were identified. This point is identified by:

# index of the first threshold for which the TPR > 0.95
idx = np.min(np.where(tpr > 0.95))

So it is looking for the minimum index in the TPR list, where the TPR is above 95%. This results in a ROC curve drawing as:

plt.figure(figsize=(10, 6), dpi=80)
plt.plot(fpr, tpr, color='coral', label='ROC curve (area = %0.3f)' % auc(fpr, tpr))
plt.plot([0, 1], [0, 1], 'k--')
plt.plot([0,fpr[idx]], [tpr[idx],tpr[idx]], 'k--', color='blue')
plt.plot([fpr[idx],fpr[idx]], [0,tpr[idx]], 'k--', color='blue')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate (1 - specificity)', fontsize=14)
plt.ylabel('True Positive Rate (recall)', fontsize=14)
plt.title('Receiver operating characteristic (ROC) curve')
plt.legend(loc="lower right")
plt.show()

The resulting figure is:

roc0
FIGURE 2. The ROC curve I nicked from Kaggle.

How to read this? In bottom left we have zero FTP, and zero TPR. This means everything is predicted as false (no survivors). This would have threshold of 100% (1.0), meaning you just classify everything as false because no-one gets over 100% probability to survive. In top right corner both TPR and FPR are 1.0, meaning you classify everyone as a survivor, and no-one as a non-survivor. So false positives are maximized as are true positives, since everyone is predicted to survive. This is a result of a threshold of 0%, and everyone gets a probability higher than 0% to survive. The blue lines indicate a point where over 95% of true positives are identified (TPR value), simultaneously leading to getting about 66% FPR. Of course, the probability threshold is not directly visible in this but has to be otherwise looked up, as we shall see in all the text I wrote below.

To see the accuracy and AUC calculated for the base (logistic regression) classifier, and the AUC calculated for the ROC in the figure above:

print(logreg.__class__.__name__+" accuracy is %2.3f" % accuracy_score(y_test, predictions))
print(logreg.__class__.__name__+" log_loss is %2.3f" % log_loss(y_test, pred_probs))
print(logreg.__class__.__name__+" auc is %2.3f" % auc(fpr, tpr))

>LogisticRegression accuracy is 0.832
>LogisticRegression log_loss is 0.419
>LogisticRegression auc is 0.877

And some more statistics from sklearn to compare against when trying to understand it all:

confusion_matrix(y_test, predictions)

>array([[156,  19],
>       [ 25,  62]])

precision_score(y_test, predictions)

>0.7654320987654321

accuracy_score(y_test, predictions)

>.8320610687022901

In the confusion matrix, there are 156 correct non-survivor predictions (true negatives), 19 wrong ones (false negatives). 62 correct survivor predictions (true positives), 25 wrong ones (false positives). The precision of the classifier is given as 0.7654320987654321 and accuracy as 0.8320610687022901.

To evaluate my ROC/AUC calculations, I first need to try to understand how the algorithms calculate the metrics, and what are the parameters being passed around. I assume the classification algorithms would use 0.5 (or 50%) as a default threshold to classify a prediction with a probability of 0.5 or more as 1 (predicted true/survives) and anything less than 0.5 as 0 (predicted non-survivor/false). So let’s try to verify that. First, to get the predictions with probability values >= 0.5 (true/survive):

#ss is my survivor-set
ss1 = np.where(pred_probs >= 0.5)
print(ss1)

>(array([  0,   1,   3,   4,   8,  13,  16,  26,  27,  30,  32,  42,  43,
>         46,  49,  50,  57,  59,  66,  67,  76,  77,  79,  83,  84,  85,
>         86,  92,  93,  98, 101, 109, 111, 113, 117, 121, 122, 125, 129,
>        130, 133, 137, 140, 143, 145, 147, 148, 149, 150, 152, 158, 170,
>        175, 176, 177, 180, 182, 183, 184, 186, 201, 202, 204, 205, 206,
>        211, 213, 215, 218, 227, 230, 234, 240, 241, 242, 246, 248, 249,
>        250, 252, 256]),)

len(ss1[0])

>81

The code above filters the predicted probabilities to get a list of values where the probability is higher than or equal to 0.5. At first look, I don’t know what the numbers in the array are. However, my guess is that they are indices into the X_train array that was passed as the features to predict from. So at X_train indices 0,1,3,4,8,… are the data points predicted as true (survivors). And here we have 81 such predictions. That is the survivor predictions, how about non-survivor predictions?

Using an opposite filter:

ss2 = np.where(pred_probs 181

Overall, 81 predicted survivors, 181 predicted casualities. Since y_test here has known labels (to test against), we can check how many real survivors there are, and what is the overall population:

y_test.value_counts()

>0    175
>1     87

So the amount of actual survivors in the test set is 87 vs non survivors 175. To see how many real survivors (true positives) there are in the predicted 81 survivors:

sum(y_test.values[i] for i in ss1[0])

>62

The above is just some fancy Python list comprehension or whatever that is for summing up all the numbers in the predicted list indices. Basically it says that out of the 81 predicted survivors, 62 were actual survivors. Matches the confusion matrix from above, so seems I got that correct. Same for non-survivors:

sum(1 for i in ss2[0] if y_test.values[i] == 0)

>156

So, 156 out of the 181 predicted non-survivors actually did not make it. Again, matches the confusion matrix.

Now, my assumption was that the classifier uses the threshold of 0.5 by default. How can I use these results to check if this is true? To do this, I try to match the sklearn accuracy score calculations using the above metrics. Total correct classifications from the above (true positives+true negatives) is 156+62. Total number of items to predict is equal to the test set size, so 262. The accuracy from this is:

(156+62)/262

>0.8320610687022901

This is a perfect match on the accuracy_score calculated above. So I conclude with this that I understood the default behaviour of the classifier correct. Now to use that understanding to see if I got the ROC curve correct. In FIGURE 1 far above, I showed the sklearn generated ROC curve for this dataset and this classifier. Now I need to build my own from scratch to see if I  get it right.

Thresholding 1-99% in 1% increments

To build one a ROC curve myself using my own threshold calculations:

tpr2 = []
fpr2 = []
#do the calculations for range of 1-100 percent confidence, increments of 1
for threshold in range(1, 100, 1):
    ss = np.where(pred_probs >= (threshold/100))
    count_tp = 0
    count_fp = 0
    for i in ss[0]:
        if y_test.values[i] == 1:
            count_tp += 1
        else:
            count_fp += 1
    tpr2.append(count_tp/87.0)
    fpr2.append(count_fp/175.0)
    print("threshold="+str(threshold)+", count_tp="+str(count_tp)+", count_fp="+str(count_fp))

The above code takes as input the probabilities (pred_probs) given by the Logistic Regression classifier for survivors. It then tries the threshold values from 1 to 99% in 1% increments. This should produce the ROC curve points, as the ROC curve should describe the prediction TPR and FPR with different probability thresholds. The code gives a result of:

threshold=1, count_tp=87, count_fp=175
threshold=2, count_tp=87, count_fp=174
threshold=3, count_tp=87, count_fp=174
threshold=4, count_tp=87, count_fp=173
threshold=5, count_tp=87, count_fp=171
threshold=6, count_tp=87, count_fp=169
...
threshold=94, count_tp=3, count_fp=1
threshold=95, count_tp=1, count_fp=0
threshold=96, count_tp=0, count_fp=0
threshold=97, count_tp=0, count_fp=0
threshold=98, count_tp=0, count_fp=0
threshold=99, count_tp=0, count_fp=0

At 1% threshold, everyone that the classifier gave a probability of 1% or higher to survive is classified as a likely survivor. In this dataset, everyone is given 1% or more survival probability. This leads to everyone being classified as a likely survivor. Since everyone is classified as survivor, this gives true positives for all 87 real survivors, but also false positives for all the 175 non-survivors. At 6% threshold, the FPR has gone down a bit, with only 169 non-survivors getting false-positives as survivors.

At the threshold high-end, the situation reverses. At threshold 94%, only 3 true survivors get classified as likely survivors. Meaning only 3 actual survivors scored a probability of 94% or more to survive from the classifier. There is one false positive at 94%, so one who was predicted to have a really high probability to survive (94%) did not. At 95% there is only one predicted survivor, which is a true positive. After that no-one scores 96% or more. Such high outliers would probably make interesting points to look into in more detail, but I won’t go there in this post.

Using the true positive rates and false positive rates from the code above (variables tpr2 and fpr2), we can make a ROC curve for these calculations as:

plt.figure(figsize=(10, 6), dpi=80)
plt.plot(fpr2, tpr2, color='coral', label='ROC curve (area = %0.3f)' % auc(fpr2, tpr2))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize=14)
plt.ylabel('True Positive Rate', fontsize=14)
plt.title('ROC test')
plt.legend(loc="lower right")
plt.show()

roc2
FIGURE 3. ROC curve for thresholds of 1-99% in 1% steps.

Comparing this with the ROC figure form sklearn (FIGURE 2), it is a near-perfect match. We can further visualize this by plotting the two on top of each other:

plt.figure(figsize=(10, 6), dpi=80)
plt.plot(fpr2, tpr2, color='coral', label='ROC curve (area = %0.3f)' % auc(fpr2, tpr2))
plt.plot(fpr, tpr, color='blue', label='ROC curve (area = %0.3f)' % auc(fpr, tpr))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize=14)
plt.ylabel('True Positive Rate', fontsize=14)
plt.title('ROC test')
plt.legend(loc="lower right")
plt.show()

roc3
FIGURE 4. My 1-99% ROC vs sklearn ROC.

There are some very small differences in the sklearn version having more “stepped” lines, whereas the one I generated from threshold of 1-99% draws the lines a bit more straight (“smoother”). Both end up with the exact same AUC value (0.877). So I would call this ROC curve calculation solved, and claim that the actual calculations match what I made here. Solved as in I finally seem to have understood what it exactly means.

What is the difference

Still, loose ends are not nice. So what is causing the small difference in the ROC lines from the sklearn implementation vs my code?

The FPR and TPR arrays/lists form the x and y coordinates of the graphs. So looking at those might help understand a bit:

fpr.shape

>(95,)

len(fpr2)

>99

The above shows that the sklearn FPR array has 95 elements, while the one I created has 99 elements. Initially I thought, maybe increasing detail in the FPR/TPR arrays I generate would help match the two. But it seems I already generated more points than is in the sklearn implementation. Maybe looking into the actual values helps:

fpr

array([0.        , 0.        , 0.00571429, 0.00571429, 0.01142857,
       0.01142857, 0.01714286, 0.01714286, 0.02285714, 0.02285714,
       0.02857143, 0.02857143, 0.03428571, 0.03428571, 0.04      ,
       0.04      , 0.04571429, 0.04571429, 0.05142857, 0.05142857,
       0.05714286, 0.05714286, 0.06285714, 0.06285714, 0.07428571,
       0.07428571, 0.07428571, 0.08      , 0.08      , 0.09142857,
       0.09142857, 0.09714286, 0.09714286, 0.10857143, 0.10857143,
       0.12      , 0.12      , 0.14857143, 0.14857143, 0.16      ,
       0.16      , 0.17142857, 0.17142857, 0.17142857, 0.18857143,
       0.18857143, 0.21142857, 0.21142857, 0.22285714, 0.22857143,
       0.23428571, 0.23428571, 0.24      , 0.24      , 0.26285714,
       0.28571429, 0.32571429, 0.33714286, 0.34857143, 0.34857143,
       0.37714286, 0.37714286, 0.38285714, 0.38285714, 0.38857143,
       0.38857143, 0.4       , 0.4       , 0.43428571, 0.45714286,
       0.46857143, 0.62857143, 0.62857143, 0.63428571, 0.63428571,
       0.64      , 0.64      , 0.64571429, 0.65714286, 0.66285714,
       0.66285714, 0.68      , 0.69142857, 0.72      , 0.73142857,
       0.73714286, 0.74857143, 0.79428571, 0.82285714, 0.82285714,
       0.85142857, 0.86285714, 0.88      , 0.89142857, 1.        ])

np.array(fpr2)

array([1.        , 0.99428571, 0.99428571, 0.98857143, 0.97714286,
       0.96571429, 0.96571429, 0.95428571, 0.92      , 0.89714286,
       0.84571429, 0.78857143, 0.66285714, 0.63428571, 0.58857143,
       0.52      , 0.49142857, 0.48      , 0.40571429, 0.39428571,
       0.38285714, 0.37714286, 0.35428571, 0.33714286, 0.32      ,
       0.31428571, 0.29714286, 0.26285714, 0.25142857, 0.24571429,
       0.23428571, 0.22857143, 0.21142857, 0.21142857, 0.20571429,
       0.20571429, 0.2       , 0.19428571, 0.18857143, 0.18857143,
       0.17714286, 0.17714286, 0.17142857, 0.16      , 0.14857143,
       0.13714286, 0.13142857, 0.12      , 0.10857143, 0.10857143,
       0.10857143, 0.09714286, 0.09714286, 0.09714286, 0.08571429,
       0.08571429, 0.08      , 0.07428571, 0.07428571, 0.06857143,
       0.06285714, 0.05714286, 0.05714286, 0.05142857, 0.05142857,
       0.04571429, 0.04      , 0.04      , 0.04      , 0.04      ,
       0.03428571, 0.02857143, 0.02857143, 0.02857143, 0.02857143,
       0.02285714, 0.02285714, 0.02285714, 0.01714286, 0.01714286,
       0.01142857, 0.01142857, 0.01142857, 0.01142857, 0.01142857,
       0.01142857, 0.00571429, 0.00571429, 0.00571429, 0.00571429,
       0.00571429, 0.00571429, 0.00571429, 0.00571429, 0.        ,
       0.        , 0.        , 0.        , 0.        ])

In the above, I print two arrays/lists. The first one (fpr) is the sklearn array. The second one (fpr2) is the one I generated myself. The fpr2 contains many duplicate numbers one after the other, whereas fpr has much more unique numbers. My guess is, the combination of fpr and tpr, as in the sklearn values might have only unique points, whereas fpr2 and tpr2 from my code has several points repeating over multiple times.

What causes this? Looking at sklearn roc_curve method, it actually returns 3 values, and I so far only used 2 of those. The return values are for variables fpr, tpr, thr. The thr one is not yet used and is actually named thresholds in the sklearn docs. What are these thresholds?

thr

array([0.95088415, 0.94523744, 0.94433151, 0.88735906, 0.86820046,
       0.81286851, 0.80847662, 0.79241033, 0.78157711, 0.76377304,
       0.75724659, 0.72956179, 0.71977863, 0.70401182, 0.70160521,
       0.66863069, 0.66341944, 0.65174523, 0.65024897, 0.6342464 ,
       0.63001635, 0.61868955, 0.61479088, 0.60172207, 0.59225844,
       0.58177374, 0.58027421, 0.57885387, 0.56673186, 0.54986331,
       0.54937079, 0.54882292, 0.52469167, 0.51584338, 0.50426613,
       0.48014018, 0.47789982, 0.45232349, 0.44519259, 0.44203986,
       0.43892674, 0.43378224, 0.42655293, 0.42502826, 0.40283515,
       0.39572852, 0.34370988, 0.34049813, 0.32382773, 0.32099169,
       0.31275499, 0.31017684, 0.30953277, 0.30348576, 0.28008661,
       0.27736946, 0.24317994, 0.24314769, 0.23412169, 0.23370882,
       0.22219555, 0.22039886, 0.21542313, 0.21186501, 0.20940184,
       0.20734276, 0.19973878, 0.19716815, 0.18398295, 0.18369904,
       0.18369725, 0.14078944, 0.14070302, 0.14070218, 0.13874243,
       0.13799297, 0.13715783, 0.13399095, 0.13372175, 0.13345383,
       0.13057344, 0.1279474 , 0.1275107 , 0.1270535 , 0.12700496,
       0.12689617, 0.12680186, 0.11906236, 0.11834502, 0.11440003,
       0.10933941, 0.10860771, 0.10552423, 0.10468277, 0.0165529 ])

This array shows how they are quite different and there is no set value that is used to vary the threshold. Unlike my attempt at doing it in 1% unit changes, this list has much bigger and smaller changes in it. Let’s try my generator code with this same set of threshold values:

tpr3 = []
fpr3 = []
#do the calculations for sklearn thresholds
for threshold in thr:
    ss = np.where(pred_probs >= threshold)
    count_tp = 0
    count_fp = 0
    for i in ss[0]:
        if y_test.values[i] == 1:
            count_tp += 1
        else:
            count_fp += 1
    tpr3.append(count_tp/87.0)
    fpr3.append(count_fp/175.0)
    print("threshold="+str(threshold)+", count_tp="+str(count_tp)+", count_fp="+str(count_fp))

And with the TPR and FPR lists calculated for these thresholds, we can visualize them as well and compare against the sklearn coordinates:

plt.figure(figsize=(10, 6), dpi=80)
plt.plot(fpr3, tpr3, color='coral', label='ROC curve (area = %0.3f)' % auc(fpr3, tpr3))
plt.plot(fpr, tpr, color='blue', label='ROC curve (area = %0.3f)' % auc(fpr3, tpr3))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize=14)
plt.ylabel('True Positive Rate', fontsize=14)
plt.title('ROC test')
plt.legend(loc="lower right")
plt.show()

Giving the figure:

roc5
FIGURE 5. Overlapping ROC curves with shared thresholds.

This time only one line is visible since the two are fully overlapping. So I would conclude that at least now I got also my ROC curve understanding validated. My guess is that sklearn does some nice calculations in the back for the ROC curve coordinates, to identify threshold points where there are visible changes and only providing those. While I would then use the ROC function that sklearn or whatever library I use (e.g., Spark) provides, this understanding I managed to build should help me make better use of their results.

Varying threshold vs accuracy

OK, just because this is an overly long post already, and I still cannot stop, one final piece of exploration. What happens to the accuracy as the threshold is varied? As 0.5 seems to be the default threshold, is it always the best? In cases where we want to minimize false positives or maximize true positives, maybe the threshold optimization is most obvious. But in case of just looking at the accuracy, what is the change? To see, I collected accuracy for every threshold value suggested by sklearn roc_curve(), and also for the exact 0.5 threshold (which was not in the roc_curve list but I assume is the classifier default):

tpr3 = []
fpr3 = []
accuracies = []
accs_ths = []

#do the calculations for sklearn thresholds
for threshold in thr:
    ss = np.where(pred_probs >= threshold)
    count_tp = sum(y_test.values[i] for i in ss[0])
    count_fp = len(ss[0]) - count_tp
    my_tpr = count_tp/87.0
    my_fpr = count_fp/175.0

    ssf = np.where(pred_probs < threshold)
    count_tf = sum(1 for i in ssf[0] if y_test.values[i] == 0)

    tpr3.append(my_tpr)
    fpr3.append(my_fpr)
    acc = (count_tp+count_tf)/y_test.count()
    accuracies.append(acc)
    accs_ths.append((acc, threshold))
    print("threshold="+str(threshold)+", tp="+str(count_tp)+", fp="+str(count_fp), ", tf="+str(count_tf), ", acc="+str(acc))

When the accuracy is graphed against the ROC curve, this looks like:

roc-acc
FIGURE 6. ROC thresholds vs Accuracy.

Eyeballing this, at the very left of FIGURE 6 the accuracy is equal to predicting all non-survivors correct (175/262=66.8%) but all survivors wrong. At the very right the accuracy is equal to predicting all survivors correct but all non-survivors wrong (87/262=33.2%). The sweet point is somewhere around 83% accuracy with about 65% true positives and maybe 8% false positives. I am just eyeballing this from the graph, so don’t take it to literally. To get a sorted list of accuracies by threshold:

plt.figure(figsize=(10, 6), dpi=80)
plt.plot(fpr3, tpr3, color='blue', label='ROC curve (area = %0.3f)' % auc(fpr3, tpr3))
plt.plot(fpr3, accuracies, color='coral', label='Accuracy' % auc(fpr3, tpr3))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize=14)
plt.ylabel('True Positive Rate', fontsize=14)
plt.title('ROC test')
plt.legend(loc="lower right")
plt.show()

And the sorted list itself:

accs_ths.sort(reverse=True)
accs_ths

results in:

  accuracy,           threshold,          tp, tn,  total correct
 (0.8396946564885496, 0.5802742149381693, 58, 162, 220),
 (0.8396946564885496, 0.5667318576914661, 59, 161, 220),
 (0.8358778625954199, 0.5788538683111758, 58, 161, 219),
 (0.8358778625954199, 0.5493707899389861, 60, 159, 219),
 (0.8358778625954199, 0.5246916651561455, 61, 158, 219),
 (0.8320610687022901, 0.6342464019774986, 52, 166, 218),
 (0.8320610687022901, 0.6186895526474085, 53, 165, 218),
 (0.8320610687022901, 0.6017220679887019, 54, 164, 218),
 (0.8320610687022901, 0.5817737394188721, 56, 162, 218),
 (0.8320610687022901, 0.549863313475955, 59, 159, 218),
 (0.8320610687022901, 0.548822920002867, 60, 158, 218),
 (0.8320610687022901, 0.5042661255726298, 62, 156, 218),
 (0.8320610687022901, 0.5000000000000000, 62, 156, 218),
 (0.8282442748091603, 0.6300163468361882, 52, 165, 217),
 (0.8282442748091603, 0.6147908791071057, 53, 164, 217),
 (0.8282442748091603, 0.5158433833082536, 61, 156, 217),
 (0.8282442748091603, 0.4778998233729275, 63, 154, 217),
 (0.8244274809160306, 0.5922584447936917, 54, 162, 216),
 (0.8244274809160306, 0.4801401821434857, 62, 154, 216),
 ...

Here a number of thresholds actually give a higher score than the 0.5 one I used as the reference for default classifier. The highest scoring thresholds being 0.5802 (58.02%) and 0.5667 (56.67%). The 0.5 threshold gets 218 predictions correct, with an accuracy of 0.8320 (matching the one from accuracy_score() at the beginning of this post). Looking at the thresholds and accuracies more generally, the ones slightly above the 0.5 threshold generally seem to do a bit better. Threshold over 0.5 but less than 0.6 seems to score best here. But there is always an exception, of course (0.515 is lower). Overall, the difference between the values is small but interesting. Is it overfitting the test data when I change the threshold and evaluate against the test data? If so, does the same consideration apply for optimizing for precision/recall using the ROC curve? Is there some other reason why the classifier would use 0.5 threshold which is not optimal? Well, my guess is, it might be a minor artefact of over/underfitting. No idea, really.

Area Under the Curve (AUC)

Before I forget. The title of the post was ROC/AUC. So what is area under the curve (AUC)? The curve is the ROC curve, the AUC is the area under the ROC curve. See FIGURE 7, where I used the epic Aseprite to paint the area under the FIGURE 5 ROC curve. Brilliant. AUC refers to the area covered by this part I colored, under the ROC curve. The value of AUC is calculated as the fraction of the overall area. So consider the whole box of FIGURE 7 as summing up to 1 (TPR x FPR = 1 x 1 = 1), and in this case AUC is the part marked in the figure as area = 0.877. AUC is calculated simply by calculating the size of the area under the curve vs the full box (at full size 1).

roc5_aucFIGURE 7. Area Under the Curve.

I believe the rationale is, the more the colored area covers, or the bigger the AUC value, the better overall performance one could expect from the classifier. As the area grows bigger, the more the classifier is able to separate true positives from false positives at different threshold values.

Uses for AUC/ROC

To me, AUC seems most useful in evaluating and comparing different machine learning algorithms. Which is also what I recall seeing it being used for. In such cases, the higher the AUC, the better overall performance you would get from the algorithm. You can then boast in your paper about having a higher overall performance metric than some other random choice.

ROC I see as mostly useful for providing extra information and an overview to help evaluate options for TPR vs FPR in different threshold configurations for a classifier. The choice and interpretation depends on the use case, of course. The usual use case in this domain is doing a test for cancer. You want to maximize your TPR so you miss out on fewer people with cancer. You can then look for your optimal location of the ROC curve to climb onto, with regards to the cost vs possibly missed cases. So you would want a high TPR there, as far as you can afford I guess. You might have a higher FPR but such is the tradeoff. In this case, the threshold would likely be lower rather than higher.

It seems harder to find examples of optimizing for low FPR with the tradeof being lower TPR as well. Perhaps one could look onto the Kaggle competitions, and pick, for example, the topic of targeted advertising. For lower FPR, you could set a higher threshold rather than lower. But your usecase could be pretty much anything, I guess.

Like I said, recently I have been looking into Spark and some algorithms seem to only give out the estimation as an AUC metric. Which is a bit odd but I guess there is some clever reason for that. I have not looked too deep into that aspect of Spark yet, probably I am just missing the right magical invocations to get all the other scores.

Some special cases are also discussed, for example, in the Fawcett paper. At some point in the curve, one classifier might have a higher point in ROC space even if having overall lower AUC value. So that some threshold would have higher value on one classifier, while other thresholds lower for the same (classifier) pair. Similarly, AUC can be higher overall but a specific classifier still better for a specific use case. Sounds a bit theoretical, but interesting.

Why is it called Receiver Operator Characteristic (ROC)?

You can easily find information on the ROC curve history referencing their origins in world war 2 and radar signal detection theory. Wikipedia ROC article has some start of this history, but is quite short. As usual, Stackexchange gives a good reference. The 1953 article seems paywalled, and I do not have access. This short description describes it as being used to measure the ability of a radio receiver to produce quality readings and enabling the operator to distinguish between false positives and true positives.

Elsewhere I read it originated from Pearl Harbour attack during WW2, where they tried to analyze why the radar operators failed to see incoming attack aircraft. What do I know. The internet is full of one-liner descriptions on this topic, all circling around the same definitions but never going into sufficient details.

Conclusions

Well the conclusion is that this was way too long post on a simple term. And trying to read it, it is not even that clearly written. But that is what I got. It helped me think about the topic and clarify and verify what it really is. Good luck.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s