Attention for Bug Predictions

Previously I wrote about building an issue category predictor using LSTM networks on Keras. This was a two-layer bi-directional LSTM network. A neural network architecture that has been gaining some "attention" recently in NLP is Attention. This is simply an approach to have the network pay some more "attention" to specific parts of the input. That’s what I think anyway. So the way to use Attention layers is to add them to other existing layers.

In this post, I look at adding Attention to the network architecture of my previous post, and how this impacts the resulting accuracy and training of the network. Since Keras still does not have an official Attention layer at this time (or I cannot find one anyway), I am using one from CyberZHG’s Github. Thanks for the free code!

Network Configurations

I tried a few different (neural) network architectures with Attention, including the ones from my previous post, with and without Glove word embeddings. In addition to these, I tried with adding a dense layer before the final output layer, after the last attention layer. Just because I head bigger is better :). The maximum model configuration of this network looks like this:

attention model

With a model summary as:

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
Input (InputLayer)           (None, 1000)              0         
_________________________________________________________________
embedding (Embedding)        (None, 1000, 300)         6000000   
_________________________________________________________________
lstm-bi1 (Bidirectional)     (None, 1000, 256)         440320    
_________________________________________________________________
drop1 (Dropout)              (None, 1000, 256)         0         
_________________________________________________________________
seq_self_attention_3 (SeqSel (None, 1000, 256)         16449     
_________________________________________________________________
lstm-bi2 (Bidirectional)     (None, 1000, 128)         164864    
_________________________________________________________________
drop2 (Dropout)              (None, 1000, 128)         0         
_________________________________________________________________
seq_weighted_attention_3 (Se (None, 128)               129       
_________________________________________________________________
mid_dense (Dense)            (None, 128)               16512     
_________________________________________________________________
drop3 (Dropout)              (None, 128)               0         
_________________________________________________________________
output (Dense)               (None, 165)               21285     
=================================================================
Total params: 6,659,559
Trainable params: 659,559
Non-trainable params: 6,000,000
_________________________________________________________________

I reduced from this for a few different configurations to try what would be the impact on the loss and accuracy of the predictions.

I used 6 variants of this. Each had the 2 bi-direction LSTM layers. Variants of this:

  • 2-att-d: each bi-lstm followed by attention. dropout of 0.5 and 0.3 after each bi-lstm
  • 2-att: each bi-lstm followed by attention, no dropout
  • 2-att-d2: each bi-lstm followed by attention, dropout of 0.2 and 0.1 after each bi-lstm
  • 2-att-dd: 2-att-d2 with dense in the end, dropouts 0.2, 0.1, 0.3
  • 1-att-d: 2 bi-directional layers, followed by single attention. dropout 0.2 and 0.1 after each bi-lstm.
  • 1-att: 2 bi-directional layers, followed by single attention. no dropout.

The code for the model definition with all the layers enabled:

    input = Input(shape=(sequence_length,), name="Input")
    embedding = Embedding(input_dim=vocab_size, 
                          weights=[embedding_matrix],
                          output_dim=embedding_dim, 
                          input_length=sequence_length,
                          trainable=embeddings_trainable,
                          name="embedding")(input)
    lstm1_bi1 = Bidirectional(CuDNNLSTM(128, return_sequences=True, name='lstm1'), name="lstm-bi1")(embedding)
    drop1 = Dropout(0.2, name="drop1")(lstm1_bi1)
    attention1 = SeqSelfAttention(attention_width=attention_width)(drop1)
    lstm2_bi2 = Bidirectional(CuDNNLSTM(64, return_sequences=True, name='lstm2'), name="lstm-bi2")(attention1)
    drop2 = Dropout(0.1, name="drop2")(lstm2_bi2)
    attention2 = SeqWeightedAttention()(drop2)
    mid_dense = Dense(128, activation='sigmoid', name='mid_dense')(attention2)
    drop3 = Dropout(0.2, name="drop3")(mid_dense)
    output = Dense(cat_count, activation='sigmoid', name='output')(drop3)
    model = Model(inputs=input, outputs=output)

Results

The 3 tables below summarize the results for the different model configurations, using the different embeddings versions of:

  • Glove initialized embeddings, non-trainable
  • Glove initialized embeddings, trainable
  • Uninitialized embeddings, trainable

Non-trainable glove:

model epoch max-accuracy min-loss
2-att-d 9 0.434 2.3655
2-att 8 0.432 2.4323
2-att 7 0.430 2.3974
2-att-d2 8 0.435 2.3466
2-att-dd 9 0.423 2.4046
1-att 8 0.405 2.3132
1-att-d 8 0.410 2.4817

Trainable glove:

model epoch max-accuracy min-loss
2-att-d 6 0.537 1.9735
2-att 5 0.540 1.9940
2-att 4 0.533 1.9643
2-add-d2 6 0.529 2.0279
2-add-d2 5 0.528 1.9852
2-add-dd 7 0.505 2.2565
1-att 7 0.531 2.0869
1-att 5 0.530 1.9927
1-att-d 6 0.526 2.0180

Trainable uninitialized:

model epoch max-accuracy min-loss
2-att-d 7 0.510 2.1198
2-att-d 6 0.502 2.106
2-att 7 0.522 2.0915
2-att-d2 8 0.504 2.2104
2-att-d2 6 0.490 2.1684
2-att-dd 9 0.463 2.4869
1-att 8 0.510 2.1413
1-att 7 0.505 2.1138
1-att-d 8 0.517 2.1552

The training curves in general look similar to this (picked from one of the best results):

attention results

So not too different from my previous results.

At least with this type of results, it is nice to see a realistic looking training + validation accuracy and loss curve, with training going up and crossing validation at some point close to where overfitting starts. I have recently done a lot of Kaggle with the intent to learn how to use all these things in practice. And I think Kaggle is really great place for this. However, the competitions seem to be geared to trick you somehow, the given training vs test sets are usually really weirdly skewed, and the competition on getting those tiny fractions of accuracy great. So compared to that this is a nice, fresh view on the real world being more sane for ML application. But I digress..

Summary insights from the above:

  • The tables above show how adding Attention does increase the accuracy by about 10%, from bit below 50% to about 54% in the best case.
  • Besides the impect from adding Attention, the rest of the configurations seem merely fiddling with it, without much impact on accuracy or loss.
  • 2 Attention layers are slightly better than one
  • Dropout quite consistently has a small negative impact on performance. As I did not try that many configurations, maybe it could be improved by tuning.
  • Final dense layer here mainly has just a negative impact
  • The Kaggle kernels I ran this with have this nasty habit of sometimes cutting the output for some parts of the code. In this case it consistently cut the un-trainable Glove version output at around 9th epoch, which is why all those in the above tables are listed as best around 8th or 9th epoch. It might have shown small gains for one or two more epochs still. However, it was plateauing so strong already I do not think it is a big issue for these results.
  • Due to training with smaller batch size taking longer, I had to limit epochs to 10 from previous post’s 15. On the other hand, Attention seems to converge faster, so not that bad a tradeoff.

Attention: My Final Notes

I used the Attention layer from the Github I linked. Very nice work in that Github in many related accounts BTW. Definitely worth checking out. This layer seems very useful. however, it seems to be tailored to the Github owners specific needs and not documented in much detail There seem to be some different variants of Attention layers I found around the interents, some only working on previous versions of Keras, others requiring 3D input, others only 2D.

For example, the above Attention layer works on the 3D inputs. SeqSelfAttention takes input as 3D sequences, outputs 3D sequences. SeqWeightedAttention takes input as 3D, outputs 2D. There is at least one implementation being copy-pasted around in Kaggle kernels that uses 2D inputs and outputs. Some other custom Keras layers seem to have gone stale. Another I found on Github seems promising but has not been updated. One of the issues links to a patched version though. But in any case, my goal was not to compare different custom implementations, so I will just wait for the final and play with this for now.

As noted, I ran these experiments on Kaggle kernels. At the time they were running on NVidia P100 GPU’s, which are intended to be datacenter scale products. These have 16GB GPU memory, which at this time is a lot. Using the two attention layers I described above, I managed to exhaust this memory quite easily. This is maybe because I used a rather large sequence length of 1000 timesteps (words). The model summary I printed above shows the Attention layers having only 16449 and 129 parameters to train, so the implementation must otherwise require plenty of space. Not that I understand the details at such depth, but something to consider.

Some of the errors I got for setting up these Attention layers also seemed to indicate it was building a 4D representation by adding another layer (of size 1000) on top of the layer it was paying attention to (the bi-LSTM in this case). This sort of makes sense, considering if it takes a 3D input (as LSTM sequence output) and pays attention to it. This attention window is just one parameter that could be tuned in this Attention implementation I used, so a better understanding of this implementation and its tuning parameters/options/impacts would likely be useful and maybe help with many of my issues.

Overall, as far as I understand, using a smaller number of timesteps is quite common. Likely using fewer would give very good results still but allow for more freedom to experiment with other parts of the model architecture without runnign out of memory. The memory issue required me to run with a much smaller batch size of 48 down from 128 and higher from before. This has yet again the effect of slowing performance as with smaller batch size takes longer to process the whole dataset.

"Official" support for Attention has been a long time coming (well, in terms of DL frameworks anyway..), and seems to be quite awaited feature (so the search-engines tell me). The comments I link above (in the Keras issue tracker on Github) also seem to contain various proposals for implementation. Perhaps the biggest issue still being the need to figure out how the Keras team wants to represent Attention to users, and how to make it as easy to use (and I suppose effective) as possible. Still, over these years of people waiting, maybe it would be nice to have something and build on that? Of course, as a typical OSS customer, I expect to have all this for free, so that is my saltmine.. 🙂

Some best practices / easy to understand documentation I would like to see:

  • Tradeoffs in using different types of Attention: 3D, 2D, attention windows, etc.
  • Attention in multi-layer architectures, where does it make the most sense and why (intuitively too)
  • Parameter explanations and tuning experiences / insights (e.g., attention window size)
  • Other general use types in different types of networks
Advertisements

Bug Report Classifier with LSTM on Keras

I previously did a review on applications of machine learning in software testing and network analysis. I was looking at updating that, maybe with some extra focus. As usual, I got distracted. This time to build an actual system to do some of the tasks discussed in those reviews. This post discusses how I built a bug report classifier based on bugreport descriptions. Or more generally, they are issues listed in a public Jira, but nevermind..

The classifier I built here is based on bi-directional LSTM (long short-term memory) networks using Keras (with Tensorflow). So deep learning, recurrent neural networks, word embeddings. Plenty of trendy things to see here.

Getting some data

The natural place to go looking for this type of data is open source projects and their bug data bases. I used the Qt project bug tracker (see, even the address has word "bug" in it, not "issue"). It seems to be based on the commonly used Jira platform. You can go to the web site and select some project, fill in filters, click on export, and directly get a CSV formatted output file that can be simply imported into Pandas and thus Python ML and data analytics libraries. This is what I did.

Since the export interface only allows for downloading data for 1000 reports at once, I scripted it. Using Selenium Webdriver I automated filling in download filters one month at a time. This script is stored in my ML-experiments Github. Along with a script that combines all the separate downloads into one CSV file. Hopefully I don’t move these around the repo too much and keep breaking the links.

Some points in building such a downloader:

  • Disable save dialog in browser via Selenium settings
  • Autosave on
  • Wait for download to complete by scanning partial files or otherwise
  • Rename latest created file according to filtered time
  • Check filesizes, bug creation dates and index continuity to see if something was missing

Exploring the data

Before running in full speed to build some classifier, it is generally good to explore the data a bit and see what it looks like, what can be learned, etc. Commonly this is called exploratory data analysis (EDA).

First, read the data and set dates to date types to enable date filters:

df_bugs = pd.read_csv("bugs/reduced.csv", 
             parse_dates=["Created", "Due Date", "Resolved"])
print(df_bugs.columns)

This gives 493 columns:

Index(['Unnamed: 0', 'Affects Version/s', 'Affects Version/s.1',
       'Affects Version/s.10', 'Affects Version/s.11', 
       'Affects Version/s.12',
       'Affects Version/s.13', 'Affects Version/s.14', 
       'Affects Version/s.15', 'Affects Version/s.16',
       ...
       'Status', 'Summary', 'Time Spent', 'Updated', 'Votes',
       'Work Ratio', 'Σ Original Estimate', 'Σ Remaining Estimate',
       'Σ Time Spent'], dtype='object', length=493)

This is a lot of fields for a bug report. The large count is because Jira seems to dump multiple valued items into multiple columns. The above snippet shows an example of "Affects version" being split to multiple columns. If one bug has at most 20 affected versions, then all exported rows will have 20 columns for "Affects version". So one item per column if there are many. A simple way I used to combine them was by the count of non-null values:

#this dataset has many very sparse columns, 
#where each comment is a different field, etc.
#this just sums all these comment1, comment2, comment3, ... 
#as a count of such items
def sum_columns(old_col_name, new_col_id):
	old_cols = [col for col in all_cols if old_col_name in col]
	olds_df = big_df[old_cols]
	olds_non_null = olds_df.notnull().sum(axis=1)
	big_df.drop(old_cols, axis=1, inplace=True)
	big_df[new_col_id] = olds_non_null

#just showing two here as an example
sum_columns("Affects Version", "affects_count")
sum_columns("Comment", "comment_count")
...
print(df_bugs.columns)
Index(['Unnamed: 0', 'Assignee', 'Created', 'Creator', 
       'Description',
       'Due Date', 'Environment', 'Issue Type', 'Issue id', 
       'Issue key',
       'Last Viewed', 'Original Estimate', 'Parent id', 'Priority',
       'Project description', 'Project key', 'Project lead', 
       'Project name', 'Project type', 'Project url', 
       ''Remaining Estimate', 'Reporter',
       'Resolution', 'Resolved', 'Security Level', 'Sprint', 
       'Sprint.1',
       'Status', 'Summary', 'Time Spent', 'Updated', 'Votes', 
       'Work Ratio', 'Σ Original Estimate', 
       'Σ Remaining Estimate', 'Σ Time Spent',
       'outward_count', 'custom_count', 'comment_count', 
       'component_count',
       'labels_count', 'affects_count', 'attachment_count',
       'fix_version_count', 'log_work_count'],
      dtype='object')

So, that would be 45 columns after combining several of the counts. Down from 493, and makes it easier to find bugs with most votes, comments, etc. This enables views such as:

df_bugs.sort_values(by="Votes", ascending=False)
        [["Issue key", "Summary", "Issue Type", 
          "Status", "Votes"]].head(10)

most voted

In a similar way, bug type counts:

order = ['P0: Blocker', 'P1: Critical', 'P2: Important',
         'P3: Somewhat important','P4: Low','P5: Not important',
         'Not Evaluated']

df_2019["Priority"].value_counts().loc[order]
       .plot(kind='bar', figsize=(10,5))

issue priorities

In addition, I ran various other summaries and visualizations on it to get a bit more familiar with the data.

The final point was to build a classifier and see how well that does. A classifier needs a classification target. I went with the assigned component. So, my classifier tries to predict the component to assign bug report to, using only the bug reports natural language description.

To start with, a look at the components. A bug report in this dataset can be assigned to multiple components. Similar to the "Affects version" above.

The distribution looks like this:

df_2019["component_count"].value_counts().sort_index()
1     64499
2      5616
3       596
4        64
5        10
6         5
7         1
8         3
9         1
10        3
11        1

This shows having and issue assigned to more than 2 components being rare, and more than 3 very rare. For this experiment, I only collected the first two components the bugs were assigned to (if any). Of those, I simply used the first assigned component as the training target label. Some further data could be had by adding training set items with labels also for second and third components. Or for all of them if feeling like it. But the first component served good enough for this experiment.

How many unique ones are in those first two?

values = set(df_2019["comp1"].unique())
values.update(df_2019["comp2"].unique())
len(values)
172

So there would be 172 components to predict. And how does their issue count distribution look like?

counts = df_2019["comp1"].value_counts()
counts
1.  QML: Declarative and Javascript Engine     5260
2.  Widgets: Widgets and Dialogs               4547
3.  Documentation                              3352
4.  Quick: Core Declarative QML                2037
5.  Qt3D                                       1928
6.  QPA: Other                                 1898
7.  Build tools: qmake                         1862
8.  WebEngine                                  1842
9.  Packaging & Installer                      1803
10. Build System                               1801
11. Widgets: Itemviews                         1534
12. GUI: Painting                              1480
13. Multimedia                                 1478
14. GUI: OpenGL                                1462
15. Quick: Controls                            1414
16. GUI: Text handling                         1378
17. Core: I/O                                  1265
18. Device Creation                            1255
19. Quick: Controls 2                          1173
20. GUI: Font handling                         1141
...
155. GamePad                                     14
156. KNX                                         12
157. QPA: Direct2D                               11
158. ODF Writer                                   9
157. Network: SPDY                                8
158. GUI: Vulkan                                  8
159. Tools: Qt Configuration Tool                 7
160. QPA: KMS                                     6
161. Extras: X11                                  6
162. PIM: Versit                                  5
163. Cloud Messaging                              5
164. Testing: QtUITest                            5
165. Learning/Course Material                     4
166. PIM: Organizer                               4
167. SerialBus: Other                             3
168. Feedback                                     3
169. Systems: Publish & Subscribe                 2
170. Lottie                                       2
171. CoAP                                         1
172. Device Creation: Device Utilities            1

Above shows how the issue count distribution is very unbalanced.

To summarize the above, there are 172 components, with largely uneven distributions. Imagine trying to predict the correct copmponent from 172 options, given that for some of them, there is very limited data available. Would seem very difficult to learn to distinguish the ones with very little training data. I guess this skewed distribution might be due to new components having little data on them. Which, in a more realistic-scenario, would merit some additional consideration. Maybe collecting these all to a new category like "Other, manually check". And updating the training data constantly, re-training the model as new issues/bugs are added. Well, that is probably a good idea anyway.

Besides these components with very few linked issues, there are several in the dataset marked as "Inactive". These would likely also be beneficial to remove from the training set, since we would not expect to see any new ones coming for them. I did not do it, as for my experiment this is fine even without. In any case, this is what is looks like:

df_2019[df_2019["comp1"].str.contains("Inactive")]["comp1"].unique()
array(['(Inactive) Porting from Qt 3 to Qt 4',
       '(Inactive) GUI: QWS Integration (Qt4)', '(Inactive) Phonon',
       '(Inactive) mmfphonon', '(Inactive) Maemo 5', 
       '(Inactive) OpenVG',
       '(Inactive) EGL/Symbian', '(Inactive) QtQuick (version 1)',
       '(Inactive) Smart Installer ', '(Inactive) JsonDB',
       '(Inactive) QtPorts: BB10', '(Inactive) Enginio'], dtype=object)

I will use the "description" column for the features (the words in the description), and the filtered "comp1" column shown above for the target.

Creating an LSTM Model for Classification

This classifier code is available on Github. As shown above, some of the components have very few values (issues to train on). Longish story shorter, I cut out the targets with 10 or less values:

min_count = 10
df_2019 = df_2019[df_2019['comp1']
              .isin(counts[counts >= min_count].index)]

This enabled me to do a 3-way train-validation-test set split and still have some data for each 3 splits for each target component. A 3-way stratified split that is. Code:

def train_val_test_split(X, y):
    X_train, X_test_val, y_train, y_test_val = 
       train_test_split(X, y, test_size=0.2, 
                        random_state=42, stratify=y)
    X_val, X_test, y_val, y_test = 
       train_test_split(X_test_val, y_test_val, test_size=0.25,
                        random_state=42, stratify=y_test_val)
    return X_train, y_train, X_val, y_val, X_test, y_test

Before using that, I need to get the data to train, that is the X (features) and y (target).

To get the features, tokenize the text. For an RNN the input data needs to be a fixed length vector (of tokens), so cut the document at seq_length if longer, or pad it to length if shorter. This uses Keras tokenizer, which I guess should be quite confident to produce suitble output for Keras..

def tokenize_text(vocab_size, texts, seq_length):
    tokenizer = Tokenizer(num_words=vocab_size)
    tokenizer.fit_on_texts(texts)
    sequences = tokenizer.texts_to_sequences(texts)

    word_index = tokenizer.word_index
    print('Found %s unique tokens.' % len(word_index))

    X = pad_sequences(sequences, maxlen=seq_length)
    print('Shape of data tensor:', X.shape)

    return data, X, tokenizer

That produces the X. To produce the y:

from sklearn.preprocessing import LabelEncoder

df_2019.dropna(subset=['comp1', "Description"], inplace=True)
# encode class values as integers 
# so they work as targets for the prediction algorithm
encoder = LabelEncoder()
df_2019["comp1_label"] = encoder.fit_transform(df_2019["comp1"])

The "comp1_label" in above now has the values for the target y variable.

To put these together:

data = df_2019["Description"]
vocab_size = 20000
seq_length = 1000
data, X, tokenizer = tokenize_text(vocab_size, data, seq_length)

y = df_2019["comp1_label"]
X_train, y_train, X_val, y_val, X_test, y_test = 
                            train_val_test_split(X, y)

The 3 sets of y_xxxx variables still need to be converted to Keras format, which is a one-hot encoded 2D-matrix. To do this after the split:

y_train = to_categorical(y_train)
y_val = to_categorical(y_val)
y_test = to_categorical(y_test)

Word Embeddings

I am using Glove word vectors. In this case the relatively small set based on 6 billion tokens (words) with 300 dimensions. The vectors are stored in a text file, with one word per line, along with the vector values. First item on line is the word, followed by the 300 dimensional vector values for it. So the following loads this into the embedding_index dictionary, keys being words and values the vectors.

def load_word_vectors(glove_dir):
    print('Indexing word vectors.')

    embeddings_index = {}
    f = open(os.path.join(glove_dir, 'glove.6B.300d.txt'),
             encoding='utf8')
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs
    f.close()

    print('Found %s word vectors.' % len(embeddings_index))
    return embeddings_index

With these loaded, convert the embedding index into matrix form that the Keras Embedding layer uses. This simply puts the embedding vector for each word at a specific index in the matrix. So if word "bob" is in index 10 in word_index, the embedding vector for "bob" will be in embedding_matrix[10].

def embedding_index_to_matrix(embeddings_index, vocab_size,
                              embedding_dim, word_index):
    print('Preparing embedding matrix.')

    # prepare embedding matrix
    num_words = min(vocab_size, len(word_index))
    embedding_matrix = np.zeros((num_words, embedding_dim))
    for word, i in word_index.items():
        if i >= vocab_size:
            continue
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            # words not found in embedding index will be all-zeros.
            embedding_matrix[i] = embedding_vector
    return embedding_matrix

Build the bi-LSTM model

I use two model versions here. The first one uses the basic LSTM layer from Keras. The second one uses the Cuda optimized CuDNNLSTM layer. I used the CuDNNLSTM layer to train this model on GPU, saved the weights after training, and then loaded the weights into the plain LSTM version. I then used the plain LSTM version to do the predictions on my laptop when developing and demoing this.

Plain LSTM version:

def build_model_lstm(vocab_size, embedding_dim, 
                     embedding_matrix, sequence_length, cat_count):
    input = Input(shape=(sequence_length,), name="Input")
    embedding = Embedding(input_dim=vocab_size, 
                          weights=[embedding_matrix],
                          output_dim=embedding_dim,
                          input_length=sequence_length,
                          trainable=False,
                          name="embedding")(input)
    lstm1_bi1 = Bidirectional(LSTM(128, return_sequences=True,
                      name='lstm1'), name="lstm-bi1")(embedding)
    drop1 = Dropout(0.2, name="drop1")(lstm1_bi1)
    lstm2_bi2 = Bidirectional(LSTM(64, return_sequences=False,
                      name='lstm2'), name="lstm-bi2")(drop1)
    drop2 = Dropout(0.2, name="drop2")(lstm2_bi2)
    output = Dense(cat_count, 
                   activation='sigmoid', name='sigmoid')(drop2)
    model = Model(inputs=input, outputs=output)
    model.compile(optimizer='adam', 
          loss='categorical_crossentropy', metrics=['accuracy'])
    return model

CuDNNLSTM version:

def build_model_lstm_cuda(vocab_size, embedding_dim, 
                 embedding_matrix, sequence_length, cat_count):
    input = Input(shape=(sequence_length,), name="Input")
    embedding = Embedding(input_dim=vocab_size,
                          output_dim=embedding_dim, 
                          weights=[embedding_matrix],
                          input_length=sequence_length,
                          trainable=False,
                          name="embedding")(input)
    lstm1_bi1 = Bidirectional(CuDNNLSTM(128, 
                              return_sequences=True, name='lstm1'),
                              name="lstm-bi1")(embedding)
    drop1 = Dropout(0.2, name="drop1")(lstm1_bi1)
    lstm2_bi2 = Bidirectional(CuDNNLSTM(64, 
                              return_sequences=False, name='lstm2'),
                              name="lstm-bi2")(drop1)
    drop2 = Dropout(0.2, name="drop2")(lstm2_bi2)
    output = Dense(cat_count, 
                   activation='sigmoid', name='sigmoid')(drop2)
    model = Model(inputs=input, outputs=output)
    model.compile(optimizer='adam', 
          loss='categorical_crossentropy', metrics=['accuracy'])
    return model

The structure of the above models visualizes to this:

model structure

The first layer is an embedding layer, and uses the embedding matrix from the pre-trained Glove vectors. This is followed by the two bi-LSTM layers, each with a Dropout layer behind it. The bi-LSTM layer looks at each word along with its context (as I discussed previously). The dropout layers help avoid overfitting. Finally, a dense layer is used to make the prediction from "cat_count" categories. Here cat_count is the number of categories to predict. It is actually categories and not cats, sorry about that.

The "weights=[embedding_matrix]" parameter given to the Embedding layer is what can be used to initialize the pre-trained word-vectors. In this case, those would be the Glove word-vectors. The current Keras Embedding docs say nothing about this parameter, which is a bit weird. Searching for this on the internet also seems to indicate it would be deprecated etc. but it also seems difficult to find a simple replacement. But it works, so I go with that..

In a bit more detail, the model summarizes to this:

model.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
Input (InputLayer)           (None, 1000)              0         
_________________________________________________________________
embedding (Embedding)        (None, 1000, 300)         6000000   
_________________________________________________________________
lstm-bi1 (Bidirectional)     (None, 1000, 256)         440320    
_________________________________________________________________
drop1 (Dropout)              (None, 1000, 256)         0         
_________________________________________________________________
lstm-bi2 (Bidirectional)     (None, 128)               164864    
_________________________________________________________________
drop2 (Dropout)              (None, 128)               0         
_________________________________________________________________
sigmoid (Dense)              (None, 165)               21285     
=================================================================
Total params: 6,626,469
Trainable params: 626,469
Non-trainable params: 6,000,000
_________________________________________________________________

This shows how the embedding layer turns the input into a suitable shape for LSTM input as I discussed in my previous post. That is, 1000 timesteps, each with 300 features. Those being the 1000 tokens for each document (issue report description) and 300 dimensional word-vectors for each token.

Another interesting point is the text at the end of the summary: "Non-trainable params: 6,000,000". This matches the number of parameters in the summary for the embedding layer. When the embedding layer is given the paremeter "trainable=False", all the parameters in it are fixed. If this is set to True, then all these parameters will be trainable as well.

Training it

Training the model is simple now that everything is set up:

checkpoint_callback = ModelCheckpoint(filepath=
   "./model-weights-issue-pred.{epoch:02d}-{val_loss:.6f}.hdf5",
   monitor='val_loss', verbose=0, save_best_only=True)

model = build_model_lstm_cuda(vocab_size=vocab_size,
                    embedding_dim=embedding_dim,
                    sequence_length=seq_length,
                    embedding_matrix=embedding_matrix,
                    cat_count=len(encoder.classes_)

history = model.fit(X_train, y_train,
          batch_size=128,
          epochs=15,
          validation_data=(X_val, y_val),
          callbacks=callbacks)

model.save("issue_model_word_embedding.h5")

score, acc = model.evaluate(x=X_test,
                            y=y_test,
                            batch_size=128)
print('Test loss:', score)
print('Test accuracy:', acc)

Notice that I use the build_model_lstm_cuda() version here. That is to train on the GPU environment, to have some sensible training time.

The callback given will just save the model weights when the validation score improves. In this case it monitors the validation loss getting smaller (no mode="max" as in my previous version).

Predicting

Prediction with a trained model:

def predict(bug_description, seq_length):
    #texts_to_sequences vs text_to_word_sequence?
    sequences = tokenizer.texts_to_sequences([bug_description])

    word_index = tokenizer.word_index
    print('Found %s unique tokens.' % len(word_index))

    X = pad_sequences(sequences, maxlen=seq_length)

    probs = model.predict(X)
    result = []
    for idx in range(probs.shape[1]):
        name = le_id_mapping[idx]
        prob = (probs[0, idx]*100)
        prob_str = "%.2f%%" % prob
        #print(name, ":", prob_str)
        result.append((name, prob))
    return result

Running this on the bug QTBUG-74496 gives the following top predictions:

Quick: Other: 0.3271
(Inactive) QtQuick (version 1): 0.4968
GUI: Drag and Drop: 0.7292
QML: Declarative and Javascript Engine: 2.6450
Quick: Core Declarative QML : 5.8533

The bigger number signified higher likelihood given by the classifier. This highlights many of the topics I mentioned above. There is one inactive component there, which relates to perhaps it being better to remove all inactive ones from training set. The top one presented (Quick: Core Declarative QML) is not the one assigned to the report at this time, but the second highest is (QML: Declarative and Javascript Engine). They both seem to be associated with the same top-level component (QML) and I do not have the expertise to say why one might be better than the other in this case.

In most of the issue reports I tried this on, it seemed to get the "correct" one as marked on the issue tracker. In the ones that did not match, the suggestion always seemed to make sense (sometimes more than what had been set by whoever sets the value), and commonly in case of mismatch, the "correct" one was in the top suggestions still. But overall, component granularity might useful to consider as well in building these types of classifiers and their applications.

Usefulness of pre-trained word-embeddings

When doing the training, I started with using the Glove embeddings, and sometimes just accidentally left them out and trained without them. This reminded me to do an experiment to see how much effect using the pre-trained embeddings actually have, and how does the accuracy etc. get affected with or without them. So I tried to train the model with different options for the Embedding layer:

  • fixed Glove vectors (trainable=False)
  • trainable Glove vectors (trainable=True)
  • trainable, non-initialized vectors (trainable=True, no Glove)

The training accuracy/loss curves look like this:

embedding accuracies

The figure is a bit small but you can hopefully expand it. At least I uploaded it bigger :).

These results can be summarized as:

  • Non-trainable Glove alone improves all the way for the 15 iterations (epochs) and reaches validation accuracy of about 0.4 and loss around 2.55. The improvements got really small there so I did not try further epochs.
  • Trainable, uninitialized (no Glove) got best validation accuracy/loss at epoch 11 for 0.488 accuracy and 2.246 loss. After this it overfits.
  • Trainable with Glove initialization reaches best validation accuracy/loss at epoch 8 for 0.497 accuracy and 2.154 loss. After this it overfits.

Some interesting points I gathered from this:

  • Overall, non-trainable Glove gives quite poor results (but I guess still quite usable) over trainable embeddings in the two other options.
  • Glove initialized but further trained embeddings converge much faster and get better scores.
  • I guess further trained Glove embeddings would be a form of "transfer" learning. Cool, can I put that in my CV now?

My guess is that the bug descriptions have many terms not commonly used in general domains, which causes the effect of requiring updates of the general Glove embeddings to be more effective. When I did some exploratory analysis of the data (e.g., TF-IDF across components) these types of terms were actually quite visible. However, they are intermixed with the general terms, which benefit from Glove and mixing the two gives best results. Just my "educated" guess. Another of my guesses is that, this would be quite a similar result in other domains as well, with domain specific terminology.

General Notes (or "Discussion" in academic terms..)

The results from the training and validation showed about 50% accuracy. In a binary classification problem this would not be so great. More like equal to random guessing. But with 160+ targets to choose from, this seems to me to be very good. The loss is maybe a better metric, but I am not that good at interpreting the loss against 160+ categories. Simply smaller is better, and it should overall signify how far off the predictions for categories are. (but how to measure and interpret the true distance of all, when you just give one as correct target label, you tell me?)

Also, as noted earlier, an issue can be linked to several components. And from my tries of running this with new data and comparing results, the mapping is not always very clear, and there can be multiple "correct" answers. This is also shown in my prediction example above. The results given by Keras predict() are actually listing the probabilities for each of the 160+ potential targets. So if accuracy just measures the likelihood of getting it exactly right, it misses the ones that are predicted at position 2, 3, and so on. One way I would see using this would be to provide assistance on selections to an expert analyzing the incoming bug reports. In such case, having "correct" answers even in the few top predictions would seem useful. Again, as shown in my prediction example above.

With that, the 50% chance of getting it 100% correct for the top prediction actually seems very good. Consider that there are over 160 possible targets to predict. So not nearly as simple as a binary classifier, where 50% would match random guessing. Here this is much better than that.

Besides the LSTM, I also tried a more traditional classifier on this same task. I tried several versions, include multinomial naive bayes, random forest, and lgbm. The features I provided were heavily pre-processed word tokens in their Tf-IDF format. That classifier was pretty poor, perhaps even useless poor. Largely perhaps due to my lack of skills in feature engineering any domain, and in lack of hyperparameter optimization. But it was still the case. With that background, I was surprised to see good performance from the LSTM version.

Overall, the predictions give by this classifier I built are not perfect but much closer than I expected to get. Out of 160+ categories getting most time the correct one, and often close to the correct one, based only on the natural language description was a very nice result for me.

In cases where the match is not perfect, I believe it still provides valuable input. Either the match is given in the other top suggestions, or maybe we can learn something about considering why some of the others are suggested, and is there some real meaning behind it. All the mismatches I found made sense when considering the reported issue from different angles. I would guess the same might hold for other domains and similar classifiers as well.

Many other topics to investigate would include:

  • different types of model architectures (layers, neuron counts, GRU, 1D CNN, …)
  • Attention layers (Keras still does not include support but they are very popular now in NLP)
  • Different dimensions of embeddings
  • Different embeddings initializers (Word2Vec)
  • effects of more preprocessing
  • N-way cross-validation
  • training the final classifier on the whole training data at once, after finishing the model tuning etc.

Learning to LSTM

What is this

This is about time-series prediction/classification with neural networks using Keras. I will not go into theory or description of recurrent neural nets or LSTM itself, rather there are plenty tutorials out there. Search engines give plenty more. Try some if not already familiar. I just try to focus on what I found confusing after reading those, and how did that go.

EDIT: Links Kaggle kernel, Github

Overly long intro

Recently I have been trying to learn to use LSTM (Long Short Term Memory) networks. I picked up the Kaggle VSB Power Line Fault Detection competition. The point of this competition was to classify power line signals as faulty or not faulty. The given data looks like this:

signal plot

Just based on the raw signal data, the challenge in this competition was to classify the signal as fault/not faulty.

As shown in the figure above, there are 800k values per signal in the data. There are 8712 signals in the training set. These signals are for 3-phase power measures. Always 3 signals together to form a single 3-phase "measurement" or observation. The total number of such 3-phase groups is 8712/3 = 2904. The data looks something like this:

train_meta = pd.read_csv("../input/metadata_train.csv")
#There are 3 rows/signals per measurement, so 6 prints first 2
train_meta.head(6)

The shape of the (training) data is 800k rows and 8712 columns, one column for each signal, one row for each signal value measured in 20 millisecond intervals.

df_sig.shape

(800000, 8712)

Few rows for the two of the first 3-signal sets (columns 1-3 and 4-6):

Goal is to use this raw data to identify fault patterns. LSTM networks were very popular in this competition as the data is a set of 8172 time-series instances. So good place to learn how to use LSTM.

I like Kaggle in general for this, as there are good kernels to get started, and discussion on what works. Of course, I always do poor on the competitions but I find it good practice anyway.

Input shape

One thing I found confusing, and based on my internet searches, many others too, is how to shape and form the input for the LSTM network. To train it, and to run the "predictions".

The input shape should be a 3-dimensional array of (number of observations, number of timesteps, number of features). Using the data from above, what would that be?

Each signal measurement would be one observation. In the way above data is structured, this would be 8712 observations, 800000 timesteps, and 1 feature (the raw value). So input shape would be (8712, 800000, 1).

This is because we have:

  • 8712 separate signal observations,
  • 800000 measurements for each signal (at 20ms intervals, so a time-series),
  • 1 feature, which is just the raw measurement value for the signal.

Number of timesteps

I initially tried to train the network with this setup. Didn’t go very well. Found some links that say the number of timesteps should be much less, such as below 200, max around 250-500, 10 steps or less, or up to 1000 timesteps. Not really sure what, but certainly much less than 800k.

One of the most popular kernels in the competition was using a 160 timestep version. How do you go from 800k timesteps to 160 timesteps? By "binning" the 800k into 160 buckets. The size of such bucket is 800000 / 160 = 5000. In Python this looks something like this:

bkt_count = 160
data_size = 800000
bkt_size = int(data_size/bkt_count)
for i in range(0, data_size, bkt_size):
    # cut data to bkt_size (bucket size)
    bkt_data_raw = data_sig[i:i + bkt_size]
    #now here bkt_data_raw needs to be summarized

The above would process the data into a set of 160 buckets, so producing the 160 timesteps.

Features

We can then calculate various summary statistics for these "buckets" as features, for example:

bkt_data_raw = data_sig[i:i + bkt_size]
bkt_avg_raw = bkt_data_raw.mean() #1
bkt_sum_raw = bkt_data_raw.sum() #1
bkt_std_raw = bkt_data_raw.std() #1
bkt_std_top = bkt_avg_raw + bkt_std_raw #1
bkt_std_bot = bkt_avg_raw - bkt_std_raw #1
bkt_percentiles = np.percentile(bkt_data_raw, [0, 1, 25, 50, 75, 99, 100]) #7

The above gives 1+1+1+1+1+7 = 12 features. I had a few more features, and experimented with various others, but this works to illustrate the concept.

So, assume we have the 160 timesteps with 12 summary features each. This would mean the LSTM input shape is now: (8712, 160, 12).

Transforming the input to 3-dimensions

Assume we loop through all the data (800k values) for a signal, and turn each into 160 rows with 12 features. How do we turn that into (8712, 160, 12) for the LSTM? The key (for me) is in the numpy.reshape() function. We can do something like this:

X = df_train.values.reshape(measures, timesteps, total_features)

This works nicely because underneath most things in Python ML libraries rely on Numpy arrays (or arrays of arrays etc.). Assuming that df_train in the code above has a number of values that is divisible by the given reshape sizes, Numpy will reshape (as the name says..) it into a set of multi-dimensional arrays as requested.

In the above the number of elements for one signal is 160*12=1920. Because a signal now has 160 timesteps (buckets) each with 12 features. Since we know the number of rows (measured signals) we expect to have, the total can be calculated as 8172*1920 = 15690240.

There are many ways to build the initial set of features. I started with putting all the features for a timestep on a single row. So I had rows with 1920 columns. 8172 such rows. Reshaping in the end worked fine and produced the correct shape of (8172,160,12).

However, with such a shape, it was a bit difficult to apply Pandas operations on the data, such as scaling all the features using sklearn scalers. Pandas operations expect a dataframe to have 2-dimensional data, where each column has data for a single feature. For example, the scalers then scale one column at a time. With 1920 columns, each feature data was at every 12th column.

To address this, I ended up with data in the format of 160 * 8172 = 1307520 rows. Each row with N columns, where N equals the number of features I had. In this case 12. The 160 timesteps for each signal were each timestep on a single row, and 160 rows following each other for one signal. Immediately followed by another 160 for the next signal, and so on. Final size in this case being a dataframe with a shape of (1307520,12). As calculated above this is the 160 timesteps for all 8172 rows, meaning 160 * 8172 = 1307520 rows.

This would look something like this for the first signal:

and for the second signal:

and so on for all 8172 signals in chunks of 160 rows.. So each feature in its own column, each timestep on its own row.

Since all the features are in the same column for all the signals and all timesteps, general transformation tools such as sklearn scalers can be applied very simply:

minmax = MinMaxScaler(feature_range=(-1,1))
df_train_scaled = pd.DataFrame(minmax.fit_transform(df_train))

This scales all the features for all timesteps and all observations (signals here) at once correctly. Well, I think it does anyway, so correct me if wrong :).

The scaled version for the 1st signal:

And to reshape it into the 3-dimensional array shaped (8172, 160, 12):

observations = 8172
timesteps = 160
total_features = 12
X = df_train_scaled.values.reshape(observations, timesteps, total_features)

The dataframe.values is a Numpy array, so the Numpy.reshape function is there.

X would now be suitable to use as input for an LSTM network. The one I have not shown how to build yet. Only a million words and stuff so far but no LSTM and this was to be all about LSTM. Nice. Good job.

Parallelizing pre-processing

On Kaggle, I eventually built my own kernel for this pre-processing to create features and scale them, and to save the results for later model experiments. It uses Python multiprocessing to create the buckets and features. I found this to have multiple benefits:

  • The transformation of the raw input data can be done once and used many times to experiment with different LSTM and other ML models and NN architectures. Since pre-processing large datasets takes time, this speeds up the process immensely.

  • Pandas is built to be single-core (as is Python itself..), so you have to do tricks like this or else your multiple cores are just sitting idly and wasted.

  • Kaggle specific: By running preprocessing in a separate kernel, I can run it in parallel in one kernel while experimenting with models in other kernels.

  • Kaggle specific: Kaggle CPU kernels have 4 CPU cores, allowing 2*faster preprocessing than in GPU kernels which have only 2 CPU cores. But you need GPU kernels to build LSTM models.

In some parallel architectures like PySpark this would be less of a problem, but I do not have access to such systems, so I work with what I have, huh. Maybe someday if I am rich enough or whatever. Plz send food, beer, monies, and epic computer parts, thx.

Building the LSTM model

I made several attempts at building a model. Here is one quite directly based on the popular Kaggle kernel I linked at the beginning:

def create_model4(input_data):
    input_shape = input_data.shape
    inp = Input(shape=(input_shape[1], input_shape[2],), name="input_signal")
    x = Bidirectional(CuDNNLSTM(128, return_sequences=True, name="lstm1"), name="bi1")(inp)
    x = Bidirectional(CuDNNLSTM(64, return_sequences=False, name="lstm2"), name="bi2")(x)
    x = Dense(128, activation="relu", name="dense1")(x)
    x = Dense(64, activation="relu", name="dense2")(x)
    x = Dense(1, activation='sigmoid', name="output")(x)
    model = Model(inputs=inp, outputs=x)
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=[matthews_correlation])
    return model

Few points here:

  • CuDNNLSTM is a specialized LSTM layer optimized for NVidia GPU’s. I always though Keras would come with automatically chosen optimal GPU implementation and pick it automatically based on backend. Seems not.

  • CuDNN is NVIDIA’s Deep Neural Network library.

  • Bidirectional LSTM refers to a network using information about "past and future". This is only useful if you actually know the future, as in translating text from one language to another and you know what words will follow the current word. In this case, I know the whole signal, so the LSTM can also look at what values follow the current "bin" value in the following bins (timesteps).

  • The "return_sequences" flag tells the layer to give out a similar 3D array as the original input. So each of the 128 cells in the first layer will not just output on value for the sequence but also output an intermediate value for each timestep. So instead of shape (128), the output would be (160, 128).

  • The second LSTM layer outputs a shape of (64), which is just the final output of the LSTM processing the timesteps. This is suitable shape for the following dense layer, since it is just an array.

But let’s see how the model actually looks like.

Inspecting the model

First, a one-way model to keep it a bit simpler still:

def create_model3(input_data):
    input_shape = input_data.shape
    inp = Input(shape=(input_shape[1], input_shape[2],), name="input_signal")
    x = CuDNNLSTM(128, return_sequences=True, name="lstm1")(inp)
    x = CuDNNLSTM(64, return_sequences=False, name="lstm2")(x)	    
    x = Dense(128, activation="relu", name="dense1")(x)
    x = Dense(64, activation="relu", name="dense2")(x)
    x = Dense(1, activation='sigmoid', name="output")(x)
    model = Model(inputs=inp, outputs=x)
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=[matthews_correlation])
    return model

Summarizing the model:

model.summary()

Gives:

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_signal (InputLayer)    (None, 160, 12)           0         
_________________________________________________________________
lstm1 (CuDNNLSTM)            (None, 160, 128)          72704     
_________________________________________________________________
lstm2 (CuDNNLSTM)            (None, 64)                49664     
_________________________________________________________________
dense1 (Dense)               (None, 128)               8320      
_________________________________________________________________
dense2 (Dense)               (None, 64)                8256      
_________________________________________________________________
output (Dense)               (None, 1)                 65        
=================================================================
Total params: 139,009
Trainable params: 139,009
Non-trainable params: 0
_________________________________________________________________

And to visualize this (in Jupyter):

from keras.utils import plot_model
plot_model(model, show_shapes=True, to_file="lstm.png")
from IPython.display import Image
Image("lstm.png")

Giving us:

This shows how the LSTM parameters affect the basic structure of the network.

  • input_signal: layer shows the shape of the input data: 160 time steps, 12 features. None for any number of rows (observations).

  • lstm1: 128 LSTM units, with return_sequences=True. Now the output is (None, 160, 128), where 128 matches the number of LSTM units, and replaces the number of features in the input. So 128 features, each one produced by a single LSTM "unit".

  • lstm2: 64 LSTM units, with return_sequences=False. Output shape is (None, 64). Outputting values at every timestep is disabled (return_sequences=False), so this is just 2 dimensional output. 64 features, one for each LSTM "unit". This for each row in the input as it comes through lstm1.

  • dense1, dense2, output: regular dense network layers.

And the same for a bi-directional LSTM:

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_signal (InputLayer)    (None, 160, 60)           0         
_________________________________________________________________
bi1 (Bidirectional)          (None, 160, 256)          194560    
_________________________________________________________________
bi2 (Bidirectional)          (None, 128)               164864    
_________________________________________________________________
dense1 (Dense)               (None, 128)               16512     
_________________________________________________________________
dense2 (Dense)               (None, 64)                8256      
_________________________________________________________________
output (Dense)               (None, 1)                 65        
=================================================================
Total params: 384,257
Trainable params: 384,257
Non-trainable params: 0
_________________________________________________________________

So bi-directional doubles the number of features the LSTM layers produce, as it goes both forward and backward across the timesteps.

Training and predicting

Training and predicting with the LSTM model is no different from others, but a quick look anyway.

Training:

splits = list(StratifiedKFold(n_splits=N_SPLITS, shuffle=True, random_state=2019).split(X, y))
preds_val = []
y_val = []
for idx, (train_idx, val_idx) in enumerate(splits):
    K.clear_session()
    print("Beginning fold {}".format(idx+1))
    train_X, train_y, val_X, val_y = X[train_idx], y[train_idx], X[val_idx], y[val_idx]
    model = model_lstm(train_X.shape)
    ckpt = ModelCheckpoint('weights_{}.h5'.format(idx), save_best_only=True, save_weights_only=True, verbose=1, monitor='val_matthews_correlation', mode='max')
    model.fit(train_X, train_y, batch_size=128, epochs=50, validation_data=[val_X, val_y], callbacks=[ckpt])
    model.load_weights('weights_{}.h5'.format(seed, idx))

Prediction:

preds = []
for i in range(N_SPLITS):
    model.load_weights('weights_{}.h5'.format(i))
    pred = model.predict(X_test_input, batch_size=300, verbose=1)
    pred_3 = []
    for pred_scalar in pred:
        for i in range(3):
            pred_3.append(pred_scalar)
    preds.append(pred_3)
threshold = 0.5
preds_test = (np.squeeze(np.mean(preds_test, axis=0)) > threshold).astype(np.int)
submission['target'] = preds
submission.to_csv('submission_{}.csv'.format(seed), index=False)
submission.head()

With this, I managed to train the LSTM and have it produce actually meaningful results. Not that I got anywhere in the competition but it was a good experience for me, since my first tries I produced some completely broken input data, formatted wrong and all. The training with that broken data format delivered zero results, which might be expected since the features were all wrong, and data for them was mixed up. I feel I mostly learned to use an LSTM, even if I failed the Kaggle competition for the smallest fractions of target metric.

From all this, I wandered a bit to look at other things, such as what is the actual structure of an LSTM network when build with Keras, how to combine other types of data and layers alongside LSTM in the same network, and what is adversarial validation. Maybe I will manage to write a bit about them next.

My Pandas Cheat Sheet

https://stackoverflow.com/questions/36794433/python-using-multiprocessing-on-a-pandas-dataframe http://www.racketracer.com/2016/07/06/pandas-in-parallel/

  import pandas as pd

  pd.__version__

Creating Dataframes

  df = pd.DataFrame()

select multiple columns as a dataframe from a bigger dataframe:

  df2 = df[['Id', 'team', 'winPlacePerc']]

select a single column as a dataframe:

  df2 = df[['name']] 
#double square brackets make the results dataframe, 
#single makes it series

pandas axis:

  axis 1 = columns, axis 0 = rows

get a series from a dataframe column filtered by another column:

  zero_names = df[df["weights"] < 10000]["names"]

turn it into a list:

  zn_list = zero_names.tolist()

N-way split:

  X_train, X_test, y_train, y_test, r_train, r_test = 
   model_selection.train_test_split(X, y, r, 
      test_size=0.25, random_state=99)

load only part of data: nrows

Accessing Values

find row at 1111th index in dataframe:

  df.iloc[1111]

find row with value 1111 for index in dataframe:

  df.loc[1111]

set value in cell (stackoverflow):

  children_df.at[555, "visited"] = True 
    #555 here refers to row index

modify row while iteration over it:

  for index, row in df.iterrows():
    df.at[index,'open_change_p1'] = day_change_op1

drop last 5 columns

  df_train.drop(df_train.columns[-5:], axis=1)

print numpy data types in multidimensional array (in jupyter return value is printed):

[type(row) for row in values[0]]

Aggregates

calculate mean for ‘team’ and ‘winPlacePerc’ columns, after grouping them by match id and group id:

  agg_cols = ['groupId', 'matchId', 'team', 'winPlacePerc']
  df_mean = df_test[agg_cols].groupby(["groupId", "matchId"],
    as_index=False).mean().add_suffix("_mean")

run multiple such aggregations at once:

  agg = df_train.groupby(['matchId'])
    .agg({'players_in_team': ['min', 'max', 'mean', 'median']})

specify the name suffixes of the generated aggregation column names:

  agg_train = df_train[agg_cols].groupby(["groupId", "matchId"], 
                        as_index=False)
    .agg([('_min', 'min'), ('_max', 'max'), ('_mean', 'mean')])

multiple aggregations will create a 2-level column header (see df.head()). to change it into one level (for merge etc):

  mi = agg_train.columns #mi for multi-index
  ind = pd.Index([e[0] + e[1] for e in mi.tolist()])
  agg_train.columns = ind

custom aggregation function:

  def q90(x):
      return x.quantile(0.9)

  agg = df_train.groupby(['matchId'])
    .agg({'players_in_team': 
       ['min', 'max', 'mean', 'median', q90]})

create new column as number of rows in a group:

  agg = df_train.groupby(['matchId'])
          .size().to_frame('players_in_match')

group and sum column for groups:

  revenues = df_train.groupby("geoNetwork_subContinent")
     ['totals_transactionRevenue'].sum()
     .sort_values(ascending=False)

Merge

merge two dataframes (here df_train and agg) by a single column:

  df_train = df_train.merge(agg, how='left', on=['groupId'])

merge on multiple columns:

  dfg_train = dfg_train.merge(agg_train, how='left',
     on=["groupId", "matchId"])

set merge suffixes = ["", "_rank"] <- left and right side on overlapping columns

  dfg_test = dfg_test.merge(agg_test_rank, suffixes=["", "_rank"],
        how='left', on=["groupId", "matchId"])

above sets columns from dfg_test to have no suffix ("") and columns from agg_test_rank to have suffix "_rank"

merge columns by value:

  df_train['rankPoints'] = np.where(df_train['rankPoints'] < 0,
       df_train['winPoints'], df_train['rankPoints'])

->above sets value as winpoints if rankpoints is below zero, else rankpoints

merge by defining the column names to match on left and right:

  pd.merge(left, right, left_on='key1', right_on='key2')

merge types (the how=merge_type in pd.merge)(link):

  • inner: keep rows that match in both left and right
  • outer: keep all rows in both left and right
  • left: keep all rows from left and matching ones from right
  • right: keep all rows from right and matching ones from left

Rename, Delete, Compare Dataframes

rename columns to remove a suffix (here remove _mean):

  df_mean = df_mean.rename(columns={'groupId_mean': 'groupId',
               'matchId_mean': 'matchId'}) 

delete dataframes to save memory:

  del df_agg

find all columns in one dataframe but not in another

  diff_set = set(train_df.columns).difference(set(test_df.columns))
  print(diff_set)

find all columns in one both dataframes and drop them

  common_set = set(train_df.columns).intersection(set(FEATS_EXCLUDED))
  train_df = train_df.drop(common_set, axis=1)

drop rows with index in list (stackoverflow):

  df.drop(df.index[[1,3]])

replace nulls/nans with 0:

  X.fillna(0)
  X_test.fillna(0)

only for specific columns:

  df[['a', 'b']] = df[['a','b']].fillna(value=0)

drop rows with null values for specific column:

  df_train.dropna(subset=['winPlacePerc'], inplace=True)

drop columns B and C:

  df.drop(['B', 'C'], axis=1)
  df.drop(['B', 'C'], axis=1, inplace = True)

Dataframe Statistics

this df has 800k rows (values) and 999 columns (features):

  df_train_subset.shape
  (800000, 999)

data types:

  df_train.dtypes

number of rows, columns, memory size (light, fast):

  df.info()

statistics per feature/column (cpu/mem heavy):

  df.describe()

number of unique values:

  df.nunique()

bounds:

  df.min()
  df.max()

replace infinity with 0. esp scalers can have issues with inf:

    X[col] = X[col].replace(np.inf, 0)

replace positive and negative inf with nan:

  df_pct.replace([np.inf, -np.inf], np.nan)

number of non-nulls per row in dataframe:

  df_non_null = df.notnull().sum(axis=1)

find the actual rows with null values:

  df_train[df_train.isnull().any(axis=1)]

number of unique values

  df.nunique()

number of times different values appear in column:

  df_train["device_operatingSystem"].value_counts()

sorted by index (e.g., 0,1,2,…)

  phase_counts = df_tgt0["phase"].value_counts().sort_index()

number of nans per column in dataframe:

  df.isnull().sum()

find columns with only one unique value

  const_cols = [c for c in train_df.columns 
                if train_df[c].nunique(dropna=False)==1 ]

drop them

  train_df = train_df.drop(const_cols, axis=1)

replace outlier values over 3std with 3std:

  upper = df_train['totals_transactionRevenue'].mean()
            +3*df_train['totals_transactionRevenue'].std()
  mask = np.abs(df_train['totals_transactionRevenue']
           -df_train['totals_transactionRevenue'].mean()) >
           (3*df_train['totals_transactionRevenue'].std())
  df_train.loc[mask, 'totals_transactionRevenue'] = upper

or use zscore:

  df['zscore'] = (df.a - df.a.mean())/df.a.std(ddof=0)

from scipy (stackoverflow)

  from scipy import stats
  import numpy as np
  z = np.abs(stats.zscore(boston_df))
  print(z)

show groupby object data statistics for each column by grouped element:

  grouped.describe()

create dataframe from classifier column names and importances (where supported), sort by weight:

  df_feats = pd.DataFrame()
  df_feats["names"] = X.columns
  df_feats["weights"] = clf.feature_importances_
  df_feats.sort_values(by="weights")

lag columns to show absolute value they changed over rows:

  for col in input_col_names:
    df_diff[col] = df_sig[col].diff().abs()

unique datatypes in dataframe:

  train_df.dtypes.unique()

show columns of selected type:

  train_df.select_dtypes(include=['float64'])

Numpy

Get numpy matrix from dataframe:

  data_diffs = df_diff.values

get column from numpy matrix as row

  for sig in range(0, 3):
      data_sig = data_measure[:,sig]

slice numpy array:

  bin_data_raw = data_sig[i:i + bin_size]

calculate statistics over numpy array:

  bin_avg_raw = bin_data_raw.mean()
  bin_sum_raw = bin_data_raw.sum()
  bin_std_raw = bin_data_raw.std()
  bin_percentiles = np.percentile(bin_data_raw, [0, 1, 25, 50, 75, 99, 100])
  bin_range = bin_percentiles[-1] - bin_percentiles[0]
  bin_rel_perc = bin_percentiles - bin_avg_raw

count outliers at different scales:

  outliers = []
  for r in range(1, 7):
      t = bin_std_diff * r
      o_count = np.sum(np.abs(bin_data_diff-bin_avg_diff) >= t)
      outliers.append(o_count)

concatenate multiple numpy arrays into one:

  bin_row = np.concatenate([raw_features, diff_features, bin_percentiles, bin_rel_perc, outliers])

limit values between min/max:

  df_sig.clip(upper=127, lower=-127)

value range in column (ptp = peak to peak):

  df.groupby('GROUP')['VALUE'].agg(np.ptp)

replace nan:

  my_arr[np.isnan(my_arr)] = 0

Time-Series

create values for percentage changes over time (row to row):

  df_pct = pd.DataFrame()
  for col in df_train_subset.columns[:30]:
      df_pct['pct_chg_'+col] =
          df_train_subset[col].pct_change().abs()

pct_change() gives the change in percentage over time, abs() makes it absolute if just looking for overall change as in negative or positive.

also possible to set number of rows to count pct_change over:

  df_pct['pct_chg_'+col] =
          df_train_subset[col].pct_change(periods=10)

pct_change on a set of items with specific values:

  df['pct_chg_open1'] = df.groupby('assetCode')['open']
           .apply(lambda x: x.pct_change())

TODO: try different ways to calculate ewm to see how this all works ewm ([stackoverflow]( EWMA: https://stackoverflow.com/questions/37924377/does-pandas-calculate-ewm-wrong)):

  df["close_ewma_10"] = df.groupby('assetName')['pct_chg_close1']
            .apply(lambda x: x.ewm(span=10).mean())

shift by 10 backwards

  y = mt_df["mket_close_10"].shift(-10)

Date Types

convert seconds into datetime

  start_col = pd.to_datetime(df_train.visitStartTime, unit='s')

parse specific columns as specific date types:

  df_train = pd.read_csv("../input/train-flat.csv.gz",
                         parse_dates=["date"], 
                         dtype={'fullVisitorId': 'str',
                              'date': 'str'
                             },
                        )

or if need to parse specific format:

  pd.to_datetime('13000101', format='%Y%m%d', errors='ignore')

access elements of datetime objects:

  data_df["weekday"] = data_df['date'].dt.weekday
  data_df["hour"] = data_df['date'].dt.hour

Data Type Manipulations

one-hot encode / convert categorical:

  dfg_train = pd.get_dummies( dfg_train, columns = ['matchType'] )
  dfg_test = pd.get_dummies( dfg_test, columns = ['matchType'] )

set category value by range:

  here 1 = solo game, 2 = duo game, 3 = squad, 4 = custom
  df_train['team'] = 
     [1 if i == 1 else 2 if i == 2 else 4 if i > 4 else 3 
              for i in df_train["players_in_team_q90"]]

calculate value over two columns and make it a new column:

  dfg_train['killsNorm'] = 
      dfg_train['kills_mean']*
      ((100-dfg_train['players_in_match'])/100 + 1)

  data_df['hit_view_ratio'] =
      data_df['totals_pageviews']/data_df['totals_hits']

mark a set of columns as category type:

  for col in X_cat_cols:
      df_train[col] = df_train[col].astype('category')

set category value based on set of values in column:

  X_test['matchType'] = X_test['matchType'].astype('str')
  X_test.loc[X_test['matchType'] == "squad-fpp", 'matchType'] =
     "squad"
  X_test['matchType'] = X_test['matchType'].astype('category')

how to get the indices from a dataframe as a list (e.g., collect a subset):

  list(outlier_collection.index.values)

to see it all on one line in Jupyter (easier copying, e.g. to drop all in list):

print(list(outlier_collection.index.values))

Jupyter

show dataframe as visual table in notebook (stackoverflow):

  from IPython.display import display, HTML

  display(df1)
  display(HTML(df2.to_html()))

pycharm notebook (jetbrains help):

correlations, unstacking correlation matrix link

Memory Reducer (From Kaggler:

def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object and col_type.name != 'category':
            #print(col_type)
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df

Investigating ROC/AUC

Executive Summary

ROC and AUC are terms that often come up in machine learning, in relation to evaluating models. In this post, I try examine what ROC curves actually are, how they are calculated, what is a threshold in ROC curve, and how it impacts the classification if you change it. The results show that using low threshold values leads to classifying more objects into positive category, and high threshold leads to more negative classifications. The Titanic dataset is used here as an example to study the threshold effect. A classifier gives probabilities for people to survive.

By default, the classifier seems to use 50% probability as a threshold to classify one as a survivor (positive class) or non-survivor (negative class). But this threshold can be varied. For example, using a threshold of 1% leads to giving a survivor label to anyone who gets a 1% or higher probability to survive by the classifier. Similarly, a threshold of 95% requires people to get at least 95% probability from the classifier to be given a survivor label. This is the threshold that is varied, and the ROC curve visualizes how the change in this probability threshold impacts classification in terms of getting true positives (actual survivors predicted as survivors) vs getting false positives (non-survivors predicted as survivors).

For the loooong story, read on. I dare you. 🙂

Introduction

Area under the curve (AUC) and receiver operator characteristic (ROC) are two terms that seem to come up a lot when learning about machine learning. This is my attempt to get it. Please do let me know where it goes wrong..

AUC is typically drawn as a curve using some figure like this (from Wikipedia):

Roccurves
FIGURE 1. Example ROC/AUC curve.

It uses true positive rate (TPR) and false positive rate (FPR) as the two measures to compare. A true positive (TP) being a correct positive prediction and a false positive (FP) being a wrong positive prediction.

There is an excellent introduction to the topic in the often cited An introduction to ROC analysis article by Tom Fawcett. Which is actually a very understandable article (for most part), unlike most academic articles I read. Brilliant. But still, I wondered, so lets see..

To see ROC/AUC in practice, instead of just reading about it all over the Internet, I decided to implement a ROC calculation and see if I can match it with the existing implementations. If so, that should at least give me some confidence on my understanding on the topic being correct.

Training a classifier to test ROC/AUC

To do this, I needed a dataset and a fitting of some basic classifier on that data to run my experiments. I have recently been going through some PySpark lectures on Udemy, so I could maybe learn some more big data stuffs and get an interesting big data job someday. Woohoo. Anyway. The course was using the Titanic dataset, so I picked that up, and wrote a simple classifier for it in Pandas/Scikit. Being a popular dataset, there is also a related Kaggle for it. Which is always nice, allowing me to create everything a bit faster by using the references, and focus on the ROC investigation faster.

The classifier I used is Logistic Regression, and the notebook is available on my GitHub (GitHub sometimes seems to have issues rendering but its there, will look into putting also on Kaggle later). Assuming some knowledge of Python, Pandas, sklearn, and the usual stuff, the training of the model itself is really the same as usual:

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=2)
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

The above is of course just a cut off into the middle of the notebook, where I already pre-processed the data to build some set of features into the features variable, and took the prediction target (survived) into the target variable. But that is quite basic ML stuff you find in every article on the topic, and on many of the “kernels” on the Kaggle page I linked. For more details, if all else fails, check my notebook linked above.

The actual training of the classifier (after preprocessings..) is as simple as the few lines above. That’s it. I then have a Logistic Regression classifier I can use to try and play with ROC/AUC. Since ROC seems to be about playing with thresholds over predicted class probabilities, I pull out the target predictions and their given probabilities for this classifier on the Titanic dataset:

predictions = logreg.predict(X_test)
predicted_probabilities = logreg.predict_proba(X_test)

Now, predicted_probabilities holds the classifier predicted probability values for each row in the dataset. 0 for never going to survive, 1 for 100% going to survive. Just predictions, of course, not actual facts. That’s the point of learning a classifier, to be able to predict the result for instance where you don’t know in advance, as you know.. 🙂

Just to see a few things about the data:

predicted_probabilities.shape

>(262, 2)

X_test.shape

>(262, 9)

This shows the test set has 262 rows (data items), there are 9 feature variables I am using, and the prediction probabilities are given in 2 columns. The classifier is a binary classifier, giving predictions for a given data instance as belonging to one of the two classes. In this case it is survived and not survived. True prediction equals survival, false prediction equals non-survival. The predicted_probability variable contains probabilities for false in column 0 and true in column 1. Since probability of no survival (false) is the opposite of survival (1-survival_probability, (true)), we really just need to keep one of those two columns. Because if it is not true, it has to be false. Right?

Cutting the true/false predictions to only include the true prediction, or the probability of survival (column 1, all rows):

pred_probs = predicted_probabilities[:, 1]

Drawing some ROC curves

With this, getting the ROC curve from sklearn is as simple as:

[fpr, tpr, thr] = roc_curve(y_test, pred_probs)

The Kaggle notebook I used as a reference, visualizes this and also drew fancy looking two dashed blue lines to illustrate a point on the ROC curve, where 95% of the surviving passengers were identified. This point is identified by:

# index of the first threshold for which the TPR > 0.95
idx = np.min(np.where(tpr > 0.95))

So it is looking for the minimum index in the TPR list, where the TPR is above 95%. This results in a ROC curve drawing as:

plt.figure(figsize=(10, 6), dpi=80)
plt.plot(fpr, tpr, color='coral', label='ROC curve (area = %0.3f)' % auc(fpr, tpr))
plt.plot([0, 1], [0, 1], 'k--')
plt.plot([0,fpr[idx]], [tpr[idx],tpr[idx]], 'k--', color='blue')
plt.plot([fpr[idx],fpr[idx]], [0,tpr[idx]], 'k--', color='blue')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate (1 - specificity)', fontsize=14)
plt.ylabel('True Positive Rate (recall)', fontsize=14)
plt.title('Receiver operating characteristic (ROC) curve')
plt.legend(loc="lower right")
plt.show()

The resulting figure is:

roc0
FIGURE 2. The ROC curve I nicked from Kaggle.

How to read this? In bottom left we have zero FTP, and zero TPR. This means everything is predicted as false (no survivors). This would have threshold of 100% (1.0), meaning you just classify everything as false because no-one gets over 100% probability to survive. In top right corner both TPR and FPR are 1.0, meaning you classify everyone as a survivor, and no-one as a non-survivor. So false positives are maximized as are true positives, since everyone is predicted to survive. This is a result of a threshold of 0%, and everyone gets a probability higher than 0% to survive. The blue lines indicate a point where over 95% of true positives are identified (TPR value), simultaneously leading to getting about 66% FPR. Of course, the probability threshold is not directly visible in this but has to be otherwise looked up, as we shall see in all the text I wrote below.

To see the accuracy and AUC calculated for the base (logistic regression) classifier, and the AUC calculated for the ROC in the figure above:

print(logreg.__class__.__name__+" accuracy is %2.3f" % accuracy_score(y_test, predictions))
print(logreg.__class__.__name__+" log_loss is %2.3f" % log_loss(y_test, pred_probs))
print(logreg.__class__.__name__+" auc is %2.3f" % auc(fpr, tpr))

>LogisticRegression accuracy is 0.832
>LogisticRegression log_loss is 0.419
>LogisticRegression auc is 0.877

And some more statistics from sklearn to compare against when trying to understand it all:

confusion_matrix(y_test, predictions)

>array([[156,  19],
>       [ 25,  62]])

precision_score(y_test, predictions)

>0.7654320987654321

accuracy_score(y_test, predictions)

>.8320610687022901

In the confusion matrix, there are 156 correct non-survivor predictions (true negatives), 19 wrong ones (false negatives). 62 correct survivor predictions (true positives), 25 wrong ones (false positives). The precision of the classifier is given as 0.7654320987654321 and accuracy as 0.8320610687022901.

To evaluate my ROC/AUC calculations, I first need to try to understand how the algorithms calculate the metrics, and what are the parameters being passed around. I assume the classification algorithms would use 0.5 (or 50%) as a default threshold to classify a prediction with a probability of 0.5 or more as 1 (predicted true/survives) and anything less than 0.5 as 0 (predicted non-survivor/false). So let’s try to verify that. First, to get the predictions with probability values >= 0.5 (true/survive):

#ss is my survivor-set
ss1 = np.where(pred_probs >= 0.5)
print(ss1)

>(array([  0,   1,   3,   4,   8,  13,  16,  26,  27,  30,  32,  42,  43,
>         46,  49,  50,  57,  59,  66,  67,  76,  77,  79,  83,  84,  85,
>         86,  92,  93,  98, 101, 109, 111, 113, 117, 121, 122, 125, 129,
>        130, 133, 137, 140, 143, 145, 147, 148, 149, 150, 152, 158, 170,
>        175, 176, 177, 180, 182, 183, 184, 186, 201, 202, 204, 205, 206,
>        211, 213, 215, 218, 227, 230, 234, 240, 241, 242, 246, 248, 249,
>        250, 252, 256]),)

len(ss1[0])

>81

The code above filters the predicted probabilities to get a list of values where the probability is higher than or equal to 0.5. At first look, I don’t know what the numbers in the array are. However, my guess is that they are indices into the X_train array that was passed as the features to predict from. So at X_train indices 0,1,3,4,8,… are the data points predicted as true (survivors). And here we have 81 such predictions. That is the survivor predictions, how about non-survivor predictions?

Using an opposite filter:

ss2 = np.where(pred_probs 181

Overall, 81 predicted survivors, 181 predicted casualities. Since y_test here has known labels (to test against), we can check how many real survivors there are, and what is the overall population:

y_test.value_counts()

>0    175
>1     87

So the amount of actual survivors in the test set is 87 vs non survivors 175. To see how many real survivors (true positives) there are in the predicted 81 survivors:

sum(y_test.values[i] for i in ss1[0])

>62

The above is just some fancy Python list comprehension or whatever that is for summing up all the numbers in the predicted list indices. Basically it says that out of the 81 predicted survivors, 62 were actual survivors. Matches the confusion matrix from above, so seems I got that correct. Same for non-survivors:

sum(1 for i in ss2[0] if y_test.values[i] == 0)

>156

So, 156 out of the 181 predicted non-survivors actually did not make it. Again, matches the confusion matrix.

Now, my assumption was that the classifier uses the threshold of 0.5 by default. How can I use these results to check if this is true? To do this, I try to match the sklearn accuracy score calculations using the above metrics. Total correct classifications from the above (true positives+true negatives) is 156+62. Total number of items to predict is equal to the test set size, so 262. The accuracy from this is:

(156+62)/262

>0.8320610687022901

This is a perfect match on the accuracy_score calculated above. So I conclude with this that I understood the default behaviour of the classifier correct. Now to use that understanding to see if I got the ROC curve correct. In FIGURE 1 far above, I showed the sklearn generated ROC curve for this dataset and this classifier. Now I need to build my own from scratch to see if I  get it right.

Thresholding 1-99% in 1% increments

To build one a ROC curve myself using my own threshold calculations:

tpr2 = []
fpr2 = []
#do the calculations for range of 1-100 percent confidence, increments of 1
for threshold in range(1, 100, 1):
    ss = np.where(pred_probs >= (threshold/100))
    count_tp = 0
    count_fp = 0
    for i in ss[0]:
        if y_test.values[i] == 1:
            count_tp += 1
        else:
            count_fp += 1
    tpr2.append(count_tp/87.0)
    fpr2.append(count_fp/175.0)
    print("threshold="+str(threshold)+", count_tp="+str(count_tp)+", count_fp="+str(count_fp))

The above code takes as input the probabilities (pred_probs) given by the Logistic Regression classifier for survivors. It then tries the threshold values from 1 to 99% in 1% increments. This should produce the ROC curve points, as the ROC curve should describe the prediction TPR and FPR with different probability thresholds. The code gives a result of:

threshold=1, count_tp=87, count_fp=175
threshold=2, count_tp=87, count_fp=174
threshold=3, count_tp=87, count_fp=174
threshold=4, count_tp=87, count_fp=173
threshold=5, count_tp=87, count_fp=171
threshold=6, count_tp=87, count_fp=169
...
threshold=94, count_tp=3, count_fp=1
threshold=95, count_tp=1, count_fp=0
threshold=96, count_tp=0, count_fp=0
threshold=97, count_tp=0, count_fp=0
threshold=98, count_tp=0, count_fp=0
threshold=99, count_tp=0, count_fp=0

At 1% threshold, everyone that the classifier gave a probability of 1% or higher to survive is classified as a likely survivor. In this dataset, everyone is given 1% or more survival probability. This leads to everyone being classified as a likely survivor. Since everyone is classified as survivor, this gives true positives for all 87 real survivors, but also false positives for all the 175 non-survivors. At 6% threshold, the FPR has gone down a bit, with only 169 non-survivors getting false-positives as survivors.

At the threshold high-end, the situation reverses. At threshold 94%, only 3 true survivors get classified as likely survivors. Meaning only 3 actual survivors scored a probability of 94% or more to survive from the classifier. There is one false positive at 94%, so one who was predicted to have a really high probability to survive (94%) did not. At 95% there is only one predicted survivor, which is a true positive. After that no-one scores 96% or more. Such high outliers would probably make interesting points to look into in more detail, but I won’t go there in this post.

Using the true positive rates and false positive rates from the code above (variables tpr2 and fpr2), we can make a ROC curve for these calculations as:

plt.figure(figsize=(10, 6), dpi=80)
plt.plot(fpr2, tpr2, color='coral', label='ROC curve (area = %0.3f)' % auc(fpr2, tpr2))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize=14)
plt.ylabel('True Positive Rate', fontsize=14)
plt.title('ROC test')
plt.legend(loc="lower right")
plt.show()

roc2
FIGURE 3. ROC curve for thresholds of 1-99% in 1% steps.

Comparing this with the ROC figure form sklearn (FIGURE 2), it is a near-perfect match. We can further visualize this by plotting the two on top of each other:

plt.figure(figsize=(10, 6), dpi=80)
plt.plot(fpr2, tpr2, color='coral', label='ROC curve (area = %0.3f)' % auc(fpr2, tpr2))
plt.plot(fpr, tpr, color='blue', label='ROC curve (area = %0.3f)' % auc(fpr, tpr))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize=14)
plt.ylabel('True Positive Rate', fontsize=14)
plt.title('ROC test')
plt.legend(loc="lower right")
plt.show()

roc3
FIGURE 4. My 1-99% ROC vs sklearn ROC.

There are some very small differences in the sklearn version having more “stepped” lines, whereas the one I generated from threshold of 1-99% draws the lines a bit more straight (“smoother”). Both end up with the exact same AUC value (0.877). So I would call this ROC curve calculation solved, and claim that the actual calculations match what I made here. Solved as in I finally seem to have understood what it exactly means.

What is the difference

Still, loose ends are not nice. So what is causing the small difference in the ROC lines from the sklearn implementation vs my code?

The FPR and TPR arrays/lists form the x and y coordinates of the graphs. So looking at those might help understand a bit:

fpr.shape

>(95,)

len(fpr2)

>99

The above shows that the sklearn FPR array has 95 elements, while the one I created has 99 elements. Initially I thought, maybe increasing detail in the FPR/TPR arrays I generate would help match the two. But it seems I already generated more points than is in the sklearn implementation. Maybe looking into the actual values helps:

fpr

array([0.        , 0.        , 0.00571429, 0.00571429, 0.01142857,
       0.01142857, 0.01714286, 0.01714286, 0.02285714, 0.02285714,
       0.02857143, 0.02857143, 0.03428571, 0.03428571, 0.04      ,
       0.04      , 0.04571429, 0.04571429, 0.05142857, 0.05142857,
       0.05714286, 0.05714286, 0.06285714, 0.06285714, 0.07428571,
       0.07428571, 0.07428571, 0.08      , 0.08      , 0.09142857,
       0.09142857, 0.09714286, 0.09714286, 0.10857143, 0.10857143,
       0.12      , 0.12      , 0.14857143, 0.14857143, 0.16      ,
       0.16      , 0.17142857, 0.17142857, 0.17142857, 0.18857143,
       0.18857143, 0.21142857, 0.21142857, 0.22285714, 0.22857143,
       0.23428571, 0.23428571, 0.24      , 0.24      , 0.26285714,
       0.28571429, 0.32571429, 0.33714286, 0.34857143, 0.34857143,
       0.37714286, 0.37714286, 0.38285714, 0.38285714, 0.38857143,
       0.38857143, 0.4       , 0.4       , 0.43428571, 0.45714286,
       0.46857143, 0.62857143, 0.62857143, 0.63428571, 0.63428571,
       0.64      , 0.64      , 0.64571429, 0.65714286, 0.66285714,
       0.66285714, 0.68      , 0.69142857, 0.72      , 0.73142857,
       0.73714286, 0.74857143, 0.79428571, 0.82285714, 0.82285714,
       0.85142857, 0.86285714, 0.88      , 0.89142857, 1.        ])

np.array(fpr2)

array([1.        , 0.99428571, 0.99428571, 0.98857143, 0.97714286,
       0.96571429, 0.96571429, 0.95428571, 0.92      , 0.89714286,
       0.84571429, 0.78857143, 0.66285714, 0.63428571, 0.58857143,
       0.52      , 0.49142857, 0.48      , 0.40571429, 0.39428571,
       0.38285714, 0.37714286, 0.35428571, 0.33714286, 0.32      ,
       0.31428571, 0.29714286, 0.26285714, 0.25142857, 0.24571429,
       0.23428571, 0.22857143, 0.21142857, 0.21142857, 0.20571429,
       0.20571429, 0.2       , 0.19428571, 0.18857143, 0.18857143,
       0.17714286, 0.17714286, 0.17142857, 0.16      , 0.14857143,
       0.13714286, 0.13142857, 0.12      , 0.10857143, 0.10857143,
       0.10857143, 0.09714286, 0.09714286, 0.09714286, 0.08571429,
       0.08571429, 0.08      , 0.07428571, 0.07428571, 0.06857143,
       0.06285714, 0.05714286, 0.05714286, 0.05142857, 0.05142857,
       0.04571429, 0.04      , 0.04      , 0.04      , 0.04      ,
       0.03428571, 0.02857143, 0.02857143, 0.02857143, 0.02857143,
       0.02285714, 0.02285714, 0.02285714, 0.01714286, 0.01714286,
       0.01142857, 0.01142857, 0.01142857, 0.01142857, 0.01142857,
       0.01142857, 0.00571429, 0.00571429, 0.00571429, 0.00571429,
       0.00571429, 0.00571429, 0.00571429, 0.00571429, 0.        ,
       0.        , 0.        , 0.        , 0.        ])

In the above, I print two arrays/lists. The first one (fpr) is the sklearn array. The second one (fpr2) is the one I generated myself. The fpr2 contains many duplicate numbers one after the other, whereas fpr has much more unique numbers. My guess is, the combination of fpr and tpr, as in the sklearn values might have only unique points, whereas fpr2 and tpr2 from my code has several points repeating over multiple times.

What causes this? Looking at sklearn roc_curve method, it actually returns 3 values, and I so far only used 2 of those. The return values are for variables fpr, tpr, thr. The thr one is not yet used and is actually named thresholds in the sklearn docs. What are these thresholds?

thr

array([0.95088415, 0.94523744, 0.94433151, 0.88735906, 0.86820046,
       0.81286851, 0.80847662, 0.79241033, 0.78157711, 0.76377304,
       0.75724659, 0.72956179, 0.71977863, 0.70401182, 0.70160521,
       0.66863069, 0.66341944, 0.65174523, 0.65024897, 0.6342464 ,
       0.63001635, 0.61868955, 0.61479088, 0.60172207, 0.59225844,
       0.58177374, 0.58027421, 0.57885387, 0.56673186, 0.54986331,
       0.54937079, 0.54882292, 0.52469167, 0.51584338, 0.50426613,
       0.48014018, 0.47789982, 0.45232349, 0.44519259, 0.44203986,
       0.43892674, 0.43378224, 0.42655293, 0.42502826, 0.40283515,
       0.39572852, 0.34370988, 0.34049813, 0.32382773, 0.32099169,
       0.31275499, 0.31017684, 0.30953277, 0.30348576, 0.28008661,
       0.27736946, 0.24317994, 0.24314769, 0.23412169, 0.23370882,
       0.22219555, 0.22039886, 0.21542313, 0.21186501, 0.20940184,
       0.20734276, 0.19973878, 0.19716815, 0.18398295, 0.18369904,
       0.18369725, 0.14078944, 0.14070302, 0.14070218, 0.13874243,
       0.13799297, 0.13715783, 0.13399095, 0.13372175, 0.13345383,
       0.13057344, 0.1279474 , 0.1275107 , 0.1270535 , 0.12700496,
       0.12689617, 0.12680186, 0.11906236, 0.11834502, 0.11440003,
       0.10933941, 0.10860771, 0.10552423, 0.10468277, 0.0165529 ])

This array shows how they are quite different and there is no set value that is used to vary the threshold. Unlike my attempt at doing it in 1% unit changes, this list has much bigger and smaller changes in it. Let’s try my generator code with this same set of threshold values:

tpr3 = []
fpr3 = []
#do the calculations for sklearn thresholds
for threshold in thr:
    ss = np.where(pred_probs >= threshold)
    count_tp = 0
    count_fp = 0
    for i in ss[0]:
        if y_test.values[i] == 1:
            count_tp += 1
        else:
            count_fp += 1
    tpr3.append(count_tp/87.0)
    fpr3.append(count_fp/175.0)
    print("threshold="+str(threshold)+", count_tp="+str(count_tp)+", count_fp="+str(count_fp))

And with the TPR and FPR lists calculated for these thresholds, we can visualize them as well and compare against the sklearn coordinates:

plt.figure(figsize=(10, 6), dpi=80)
plt.plot(fpr3, tpr3, color='coral', label='ROC curve (area = %0.3f)' % auc(fpr3, tpr3))
plt.plot(fpr, tpr, color='blue', label='ROC curve (area = %0.3f)' % auc(fpr3, tpr3))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize=14)
plt.ylabel('True Positive Rate', fontsize=14)
plt.title('ROC test')
plt.legend(loc="lower right")
plt.show()

Giving the figure:

roc5
FIGURE 5. Overlapping ROC curves with shared thresholds.

This time only one line is visible since the two are fully overlapping. So I would conclude that at least now I got also my ROC curve understanding validated. My guess is that sklearn does some nice calculations in the back for the ROC curve coordinates, to identify threshold points where there are visible changes and only providing those. While I would then use the ROC function that sklearn or whatever library I use (e.g., Spark) provides, this understanding I managed to build should help me make better use of their results.

Varying threshold vs accuracy

OK, just because this is an overly long post already, and I still cannot stop, one final piece of exploration. What happens to the accuracy as the threshold is varied? As 0.5 seems to be the default threshold, is it always the best? In cases where we want to minimize false positives or maximize true positives, maybe the threshold optimization is most obvious. But in case of just looking at the accuracy, what is the change? To see, I collected accuracy for every threshold value suggested by sklearn roc_curve(), and also for the exact 0.5 threshold (which was not in the roc_curve list but I assume is the classifier default):

tpr3 = []
fpr3 = []
accuracies = []
accs_ths = []

#do the calculations for sklearn thresholds
for threshold in thr:
    ss = np.where(pred_probs >= threshold)
    count_tp = sum(y_test.values[i] for i in ss[0])
    count_fp = len(ss[0]) - count_tp
    my_tpr = count_tp/87.0
    my_fpr = count_fp/175.0

    ssf = np.where(pred_probs < threshold)
    count_tf = sum(1 for i in ssf[0] if y_test.values[i] == 0)

    tpr3.append(my_tpr)
    fpr3.append(my_fpr)
    acc = (count_tp+count_tf)/y_test.count()
    accuracies.append(acc)
    accs_ths.append((acc, threshold))
    print("threshold="+str(threshold)+", tp="+str(count_tp)+", fp="+str(count_fp), ", tf="+str(count_tf), ", acc="+str(acc))

When the accuracy is graphed against the ROC curve, this looks like:

roc-acc
FIGURE 6. ROC thresholds vs Accuracy.

Eyeballing this, at the very left of FIGURE 6 the accuracy is equal to predicting all non-survivors correct (175/262=66.8%) but all survivors wrong. At the very right the accuracy is equal to predicting all survivors correct but all non-survivors wrong (87/262=33.2%). The sweet point is somewhere around 83% accuracy with about 65% true positives and maybe 8% false positives. I am just eyeballing this from the graph, so don’t take it to literally. To get a sorted list of accuracies by threshold:

plt.figure(figsize=(10, 6), dpi=80)
plt.plot(fpr3, tpr3, color='blue', label='ROC curve (area = %0.3f)' % auc(fpr3, tpr3))
plt.plot(fpr3, accuracies, color='coral', label='Accuracy' % auc(fpr3, tpr3))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize=14)
plt.ylabel('True Positive Rate', fontsize=14)
plt.title('ROC test')
plt.legend(loc="lower right")
plt.show()

And the sorted list itself:

accs_ths.sort(reverse=True)
accs_ths

results in:

  accuracy,           threshold,          tp, tn,  total correct
 (0.8396946564885496, 0.5802742149381693, 58, 162, 220),
 (0.8396946564885496, 0.5667318576914661, 59, 161, 220),
 (0.8358778625954199, 0.5788538683111758, 58, 161, 219),
 (0.8358778625954199, 0.5493707899389861, 60, 159, 219),
 (0.8358778625954199, 0.5246916651561455, 61, 158, 219),
 (0.8320610687022901, 0.6342464019774986, 52, 166, 218),
 (0.8320610687022901, 0.6186895526474085, 53, 165, 218),
 (0.8320610687022901, 0.6017220679887019, 54, 164, 218),
 (0.8320610687022901, 0.5817737394188721, 56, 162, 218),
 (0.8320610687022901, 0.549863313475955, 59, 159, 218),
 (0.8320610687022901, 0.548822920002867, 60, 158, 218),
 (0.8320610687022901, 0.5042661255726298, 62, 156, 218),
 (0.8320610687022901, 0.5000000000000000, 62, 156, 218),
 (0.8282442748091603, 0.6300163468361882, 52, 165, 217),
 (0.8282442748091603, 0.6147908791071057, 53, 164, 217),
 (0.8282442748091603, 0.5158433833082536, 61, 156, 217),
 (0.8282442748091603, 0.4778998233729275, 63, 154, 217),
 (0.8244274809160306, 0.5922584447936917, 54, 162, 216),
 (0.8244274809160306, 0.4801401821434857, 62, 154, 216),
 ...

Here a number of thresholds actually give a higher score than the 0.5 one I used as the reference for default classifier. The highest scoring thresholds being 0.5802 (58.02%) and 0.5667 (56.67%). The 0.5 threshold gets 218 predictions correct, with an accuracy of 0.8320 (matching the one from accuracy_score() at the beginning of this post). Looking at the thresholds and accuracies more generally, the ones slightly above the 0.5 threshold generally seem to do a bit better. Threshold over 0.5 but less than 0.6 seems to score best here. But there is always an exception, of course (0.515 is lower). Overall, the difference between the values is small but interesting. Is it overfitting the test data when I change the threshold and evaluate against the test data? If so, does the same consideration apply for optimizing for precision/recall using the ROC curve? Is there some other reason why the classifier would use 0.5 threshold which is not optimal? Well, my guess is, it might be a minor artefact of over/underfitting. No idea, really.

Area Under the Curve (AUC)

Before I forget. The title of the post was ROC/AUC. So what is area under the curve (AUC)? The curve is the ROC curve, the AUC is the area under the ROC curve. See FIGURE 7, where I used the epic Aseprite to paint the area under the FIGURE 5 ROC curve. Brilliant. AUC refers to the area covered by this part I colored, under the ROC curve. The value of AUC is calculated as the fraction of the overall area. So consider the whole box of FIGURE 7 as summing up to 1 (TPR x FPR = 1 x 1 = 1), and in this case AUC is the part marked in the figure as area = 0.877. AUC is calculated simply by calculating the size of the area under the curve vs the full box (at full size 1).

roc5_aucFIGURE 7. Area Under the Curve.

I believe the rationale is, the more the colored area covers, or the bigger the AUC value, the better overall performance one could expect from the classifier. As the area grows bigger, the more the classifier is able to separate true positives from false positives at different threshold values.

Uses for AUC/ROC

To me, AUC seems most useful in evaluating and comparing different machine learning algorithms. Which is also what I recall seeing it being used for. In such cases, the higher the AUC, the better overall performance you would get from the algorithm. You can then boast in your paper about having a higher overall performance metric than some other random choice.

ROC I see as mostly useful for providing extra information and an overview to help evaluate options for TPR vs FPR in different threshold configurations for a classifier. The choice and interpretation depends on the use case, of course. The usual use case in this domain is doing a test for cancer. You want to maximize your TPR so you miss out on fewer people with cancer. You can then look for your optimal location of the ROC curve to climb onto, with regards to the cost vs possibly missed cases. So you would want a high TPR there, as far as you can afford I guess. You might have a higher FPR but such is the tradeoff. In this case, the threshold would likely be lower rather than higher.

It seems harder to find examples of optimizing for low FPR with the tradeof being lower TPR as well. Perhaps one could look onto the Kaggle competitions, and pick, for example, the topic of targeted advertising. For lower FPR, you could set a higher threshold rather than lower. But your usecase could be pretty much anything, I guess.

Like I said, recently I have been looking into Spark and some algorithms seem to only give out the estimation as an AUC metric. Which is a bit odd but I guess there is some clever reason for that. I have not looked too deep into that aspect of Spark yet, probably I am just missing the right magical invocations to get all the other scores.

Some special cases are also discussed, for example, in the Fawcett paper. At some point in the curve, one classifier might have a higher point in ROC space even if having overall lower AUC value. So that some threshold would have higher value on one classifier, while other thresholds lower for the same (classifier) pair. Similarly, AUC can be higher overall but a specific classifier still better for a specific use case. Sounds a bit theoretical, but interesting.

Why is it called Receiver Operator Characteristic (ROC)?

You can easily find information on the ROC curve history referencing their origins in world war 2 and radar signal detection theory. Wikipedia ROC article has some start of this history, but is quite short. As usual, Stackexchange gives a good reference. The 1953 article seems paywalled, and I do not have access. This short description describes it as being used to measure the ability of a radio receiver to produce quality readings and enabling the operator to distinguish between false positives and true positives.

Elsewhere I read it originated from Pearl Harbour attack during WW2, where they tried to analyze why the radar operators failed to see incoming attack aircraft. What do I know. The internet is full of one-liner descriptions on this topic, all circling around the same definitions but never going into sufficient details.

Conclusions

Well the conclusion is that this was way too long post on a simple term. And trying to read it, it is not even that clearly written. But that is what I got. It helped me think about the topic and clarify and verify what it really is. Good luck.

PoW vs PoS. Whut?

Recently I have tried to understand blockchains and how they actually work. It seems the internet is full of high-level descriptions that always stop before actually explaining how the different blockchain variants really work. What is proof-of-work in practice, and what does proof-of-stake actually mean? What actually happens when you deploy a blockchain node in different variants, how does the blockchain “network” reach “consensus”? What is the blockchain “network”? What is “concensus”? What is “byzantine fault tolerance” and how does it relate to blockchains?

I started with looking into Bitcoin, since it is the most popular and has the most information available. There is a nice book about it even, with reasonable amounts of implementation level details as well. The book is named “Mastering Bitcoin: Programming the Open Blockchain 2nd Edition”. Just in case the Amazon link breaks. I am sure there are many other great books and resources but I really liked that one. The Naivecoin tutorial I discussed before is also good stuff. So generally, resources are there for understanding Bitcoin style PoW blockchains, also at more detailed level. But let’s see what I think, and you can then correct me. I start with PoW and then give some PoS assumptions.

Proof of Work (PoW)

Proof of Work simply refers to running a specific hash function for long enough to find a set of values that produces a hash value matching some given criterion. An example using a generic block structure, mostly imitating Bitcoin block structure:

  • prev_hash: Previous block hash
  • merkle_root: Hash for data in this block (Merkle root)
  • timestamp: Current timestamp
  • diff_target: Difficulty target
  • nonce: A value that is changed to try to find different hashes
  • transaction_list: List of transactions included in this block

I use the above properties as an example for hashing. Additionally, there would be a number of more generic meta-data, such as magic number, block size, version. Such metadata are used in parsing to find expected data size of block (e.g., bytes to read from socket), and to verify the block data is valid (correct magic number, version number, ..). For more details on actual Bitcoin block elements, see e.g., here.

To build a hash for PoW, the above list of data can be all hashed together. So it becomes something like this:

block_hash = SHA256(prev_hash+merkle_root+timestamp+diff_target+nonce+transaction_list)

This is slightly simplified, but serves the purpose. Here SHA256 is a specific hash function, and various other hash functions are often used as well. The resulting block_hash can be used to verify the contents of the block, since to get the same hash value you need to have the exact same block contents as input for the hash function. This protects from anyone changing a processed block (e.g., modifying existing transactions already in the blockchain). Since the attacker would have to be able to re-calculate a new hash for this block, as well as all blocks coming after it (so the longer/deeper the block is in the chain, the “safer” it is from modification as it is buried deeper under new blocks).

The blockchain itself is just a long list of blocks, each referencing to the previous block (with prev_hash). The term chain refers to each block linking to previous one this way. This connection is enforced by including the prev_hash in the next block along with block_hash of the block itself. So you cannot just change the order of the blocks in the chain, as the linked prev_hash values would no longer match.

Look, I made a pretty picture:

chain

Here, the block itself contains the block_hash, prev_hash, and block_data. So block_hash is calculated over all the rest of the data in the block (including prev_hash). Prev_hash links the block to the previous block in the chain, as it is the block_hash of the previous block. Block_data contains all the rest of the data, and is of course included in the block_hash, meaning block_hash verifies the integrity of all this data. Because changing block_data (or prev_hash) would change block_hash, and thus invalidate the whole chain (links would no longer be valid as prev_hash would differ from block_hash of previous block). Cute.

Anyway. What is proof of work? Anyone can calculate a block hash using the above hashing formula. But not just any hash will be accepted by the blockchain network.  The acceptance criteria is defined in the software designed to run the blockchain network. The most common one is the difficulty based criteria. The Naivecoin tutorial has a good and practical description of one. The difficulty is calculated using an algorithm with several parameters. A typical example:

  • current_difficulty: Current mining (PoW) difficulty.
  • adjustment_interval: Number of blocks to produce until difficulty should be adjusted.
  • target_interval: The target time that should be spent between generating blocks.

The blockchain always targets to produce blocks with an interval time of target_interval between them. For example, Bitcoin uses a target_interval value of 10 minutes, and a adjustment_interval of 2016 blocks. With a good difficulty value, 2016 blocks should take 2 weeks to generate. This is 2016*10/60/24=14 days. 2016 is the number of blocks in adjustment interval, 10 is for 10 minutes per block, 60 is for minutes in hour, 24 for hours in a day. If mining 2016 blocks takes less time than 14 days, the difficulty will be adjusted higher. If the mining time is more than 14 days, the difficulty is adjusted lower. This way the blockchain network adjusts its parameters over time to keep the amount of coins mined close to the specified rate. Many blockchains use shorter than 10 minute time_interval and shorter than 14 days adjustment_interval. But generally, the idea is again the same. This is the basic idea, which can be built on by other difficulty adjustment algorithms such as LWMA, which seems to be quite popular with many altcoins currently.

The process of mining in PoW systems is to try hashing different block configurations (thus different hash inputs), until you get one that passes the difficulty check. The nonce parameter is there specifically for this purpose, to let miners try different values to get a hash that passes the difficulty check. Additionally, one can play to some extent with the choice of transactions to include as well as the exact timestamp used. These all change the input to the hash function, and thus the produced hash value. But nonce is the intended main value to change. Additional considerations can also be applied, such as extra_nonce etc.  but the main concept is the same.

What is the difficulty exactly? It is simply a reference to a hash value that needs to be found. With higher difficulty, the target hash value is set to be harder to find. With lower difficulty, it is set to be easier to find. This is based on the fact that a hash value is just a very large number, even if usually represented as a hex-string. Examples:

  • Difficulty target set to 100k: You need to find a hash value that is less than 100k.
  • Difficulty target set to 1M: You need to find a hash value that is less than 1M.

Out of these two, the 100k difficulty target is higher in difficulty, because there are much fewer possible target values (100000). 1M is a much “easier” difficulty target since there are 10x the possible accepted hash output values (100000 vs 1 million). Thus, a higher difficulty target means a lower difficulty (i.e., easier to find a suitable hash and thus create a valid block). Most confusing, I am sure. There can be variants of this, such as using a number of leading zeroes required for the hash output instead of under a specific number, but the concept is still the same.

The hashing algorithm presented far above (block_hash=…) produces a single value. And we cannot control what value it produces, since hash-functions are one-way and no-one has been able to figure a way to reverse them (directly generate input to give specific hash output). So the only way to find a hash matching the target difficulty is to keep trying different input combinations until we hit one which produces a suitable hash value (below the minimum difficulty target). The “Proof of Work” is then just proof that you have generated huge numbers of these hashes, enough to find one matching the difficulty criteria. Hashing is the work, the proof is producing a hash matching the difficulty target.

Peer to Peer (P2P) consensus

Once someone finds a block (produces an accepted hash..), the block needs to be added to the blockchain. This happens when the finding node propagates the new block with the valid block hash to its peers, which propagate it further to their peers, and so on for the whole network. The peers can all check that the data in the block produces a valid hash in relation to the difficulty target, and that the data in the block is verified (prev_hash matches the block_hash on previous block added to the chain, transactions in transaction_list are valid, etc.). This of course requires all nodes to be connected in a peer-to-peer fashion.

The peer-to-peer network is established and continously updates itself over time. When a blockchain node boots up, it typically establishes a connection to the blockchain network via a pre-configured set of seed nodes. These can be static IP addresses stored in the client, or a set of pre-defined DNS server addresses. Once the node has connected this way to its first set of peers, these peers provide it with additional peer nodes to connect to. The node can request more peers from these peers, and so on. Connections to nodes may drop over time, and the node can use peer lists it receives from others to find new ones.

The initial synchronization of the blockchain state happens by the nodes sending their confirmed blocks, transactions, and other blockchain state information to each other and any new peer joining the network. Once nodes are synchronized, the new mined blocks are added to the blockchain as soon as some node finds them. New “money” is created by the mining node adding a “coinbase” transaction to any new block it finds (gets an accepted hash for), giving itself the number of coins agreed in the network specification for finding a new block.

So this is how distributed consensus is achieved in PoW networks. But how about PoS networks? Not too different (or so I think), but the way of creating and assigning new blocks has to be different. That is the whole point, of course..

Proof of Stake (PoS)

The above information on Proof of Work is actually relatively simple to find on the internets and books. The Bitcoin book I linked above is one great source for such information. This is maybe because Bitcoin is such a well known and long-standing project, it has a huge market for information, and so there a lot of information about it. For Proof of Stake, there is much less information, and I cannot name a single coin in PoS side that would be such a “canonical” project. Ethereum is talking about making the switch from PoW to PoS, and some proposals for this in the form of “Casper” proposals are available. However, these are written in a very theoretical manner, and digress all over the place. Not so great for someone like me, who just wants a good understanding at a bit more detailed level.

So what do I think is Proof of Stake? At a high level it is always described as distributing new coins based on the amount of coins you already have. So each round (staking interval) gives a set of new coins to existing owners, based on their “stake” in the network. The stake would be the number of coins you already have, and are willing to “stake”. And quite often this is referred to as requiring the users wallet to be online to be “staking”, that is to be able to receive these rewards. So one might expect that the wallet is connected to other peers in a similar peer-to-peer network as for PoW, but the network somehow agrees to give new coins pediodically to someone “staking” instead of using a PoW hashing contest. But how does this staking actually happen? How does a distributed and assumably decentralized network ever agree on who gets the coins?

I found some Reddit post (which I lost the link for) and a couple of StackExchange posts (post1, post2). And the Blackcoin and Peercoin whitepapers. These all seem to describe the same thing, even if with a bit of a different twist. In any case, what they describe makes sense, so I will go with that. There is also Ouroboros whitepaper, Tendermint whitepaper, and the Ethereum PoS FAQ. But these drift off in way too academic way for me.

The algorithm I get from these is very similar to that of PoW hashing. The main difference is that, there is no field in the block data that is hashed that can be easily varied to such a large extent that is done in the PoW implementations. Instead, only a selected set of core values in the block is selected as the state given as input to the hash function. This set of core values is called the “kernel”. Since the data selected for this input has only a limited amount of possible values, and thus hash outputs, it makes no sense for the “miner” to spend a lot of effort and energy on trying different values. Mainly because there are only a few values to try at each point in time. This is how you try to escape the hashing contest of PoW coins, and save the planet by reducing energy consumption.

So in this scenario, I would assume each “staker” takes their set of kernel values, along with a timestamp, hashes these, and if they get a value that fits the difficulty target of the network, they can be selected to “mint” new coins. The timestamp is masked to remove the lowest bits, in order to make it unfeasible to try a large number of close-to-each-other timestamps. The number of coins owned is used as a way to weight the hash outcome higher, and to increase the likelihood to “mint” the new coins on each round based on the number of coins staked. A number of previous blocks hashes is also included in the equation to make it more difficult to try to come up with different hash values for future blocks.

From the posts I linked above, I get the more sensible explanation of how the node to create a block is selected. So the hash you end up with from your kernel data is compared against the network difficulty (similar to PoW), and the difficulty is also adjusted at similar intervals as with PoW. Some of the early coins (e.g., the Peercoin whitepaper) also discuss adding coin age into the equation to favor coins that have not been selected recently. Later on the Blackcoin whitepaper sees this as a potential problem (even possible attack vector) and removes age from the equation. And from there I guess it has diverged to all sorts of academic exercise, that are simple but made to look complex.

In any case, I guess there are other ways to do different parts of PoS. But the process in general as I described above seems like it would make sense. Please do correct and tell me how it really works.. 🙂

Something I wonder after all this is, why is it always said you need to have your wallet online to “stake”? I don’t see a need, other than to have something around to calculate your kernel hash. You could just grab the kernel input data from the wallet, and keep a separate program online to calculate the PoS hashes. This way there would be much less risk of wallet being compromised, which is often cited as a risk of PoS staking. I must be missing something again?

 

Predicting issue categories on Github

Practical examples of applying machine learning seem to be a bit difficult to find. So I tried to create one for a presentation I was doing on testing and data analytics. I made a review of works in the area, and just chose one for illustrate. This one tries to predict a target category to assign for an issue report. I used ARM mBed OS as a test target since it has issues available on Github and there were some people who work with it attending the presentation.

This demo “service” I created works by first training a predictive model based on a set of previous issue reports. I downloaded the reports from the issue repository. The amount of data available there was so small, I just downloaded the issues manually using the Github API that let me download the data for 100 issues at once. Automating the download should be pretty easy if needed. The amount of data is small, and there are a large number of categories to predict, so not the best for results, but serves as an example to illustrate the concept.

And no, there is no deep learning involved here, so not quite that trendy. I don’t think it is all that necessary for this purpose or this size of data. But could work better of course, if you do try, post the code so we can play as well.

The Github issues API allows me to download the issues in batches. For example, to download page 12 of closed issues, with 100 issues per page, the URL to request is https://api.github.com/repos/ARMmbed/mbed-os/issues?state=closed&page=12&per_page=100. The API seems to cut it down to 100 even if using bigger values than 100. Or I just didn’t quite use it right, whichever. The API docs describe the parameters quite clearly, I downloaded open and closed issues separately, even if I did not use the separation in any meaningful way in the end.

The code here is all in Python. The final classifier/prediction services code is available on my Github repository.

First build a set of stopwords to do some cleaning on the issue descriptions:

	stop_words = set(stopwords.words('english'))
	stop_words = stop_words.union(set(punctuation))
	stop_words.update(["''", "--", "**"])

The above code uses the common NLTK stopwords, a set of punctuation symbols, and a few commonly occurring symbol combinations I found in the data. Since later on I clean it up with another regular expression, probably just the NLTK stopwords would suffice here as well..

To preprocess the issue descriptions before applying machine learning algorightms:

def preprocess_report(body_o):
	#clean issue body text. leave only alphabetical and numerical characters and some specials such as +.,:/\
	body = re.sub('[^A-Za-z0-9 /\\\_+.,:\n]+', '', body_o)
	# replace URL separators with space so the parts of the url become separate words
	body = re.sub('[/\\\]', ' ', body)
	# finally lemmatize all words for the analysis
	lemmatizer = WordNetLemmatizer()
	# text tokens are basis for the features
	text_tokens = [lemmatizer.lemmatize(word) for word in word_tokenize(body.lower()) if word not in stop_words]
	return text_tokens

Above code is intended to remove all but standard alphanumeric characters from the text, remove stop words, and tokenize the remaining text into separate words. It also splits URL’s into parts as separate words. The lemmatization changes known words into their baseforms (e.g., “car” and “cars” become “car”). This just makes it easier for the machine learning algorithm to match words together. Another option is stemming, but lemmatization produces more human-friendly words so I use that.

I stored the downloaded issues as JSON files (as Github API gives) in the data directory. To read all these filenames for processing:

#names of files containing closed and open issues (at time of download)
closed_files = glob.glob("data/*-closed-*")
open_files = glob.glob("data/*-closed-*")

To process those files, I need to pick only the ones with an assigned “component” value. This is what is the training target label. The algorithm is trained to predict this “component” value from the issue description, so without this label, the piece of data is not useful for training.

def process_files(files):
	'''
	process the given set of files by collecting issue body text and labels.
	also cleans and lemmatizes the body text

	:param files: names of files to process
	:return: nothing
	'''
	global total

	for filename in files:
		with open(filename, encoding="utf-8") as json_data:
			print(filename)
			all_issues = json.load(json_data)
			for issue in all_issues:
				labels = issue["labels"]
				for label in labels:
					if label["name"].startswith("component:"):
						name = label["name"]
						body_o = issue["body"]
						text_tokens = preprocess_report(body_o)
						all_text_tokens.append((text_tokens))
						#component_labels are prediction targets
						component_labels.append(name)
						total += 1

print("total: ", total)
						

There is a limited number of such labeled data items, as many of the downloaded issues do not have this label assigned. The print at the end of the above code shows the total number of items with the “component” label given, and the number in this dataset is 1078.

Besides removing stop-words and otherwise cleaning up the documents for NLP, combining words sometimes makes sense. Pairs, triplets, and so on are sometimes meaningful. Typical example is words “new” and “york” in a document, versus “new york”. This would be an example of a bi-gram, combining two words into “new_york”. To do this, I use the gensim package:

import gensim

#https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/
# Build the bigram and trigram models
bigram = gensim.models.Phrases(all_text_tokens, min_count=5, threshold=100) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[all_text_tokens], threshold=100)

# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

#just to see it works
print(trigram_mod[bigram_mod[all_text_tokens[4]]])

#transform identified word pairs and triplets into bigrams and trigrams
text_tokens = [trigram_mod[bigram_mod[text]] for text in all_text_tokens]

#build whole documents from text tokens. some algorithms work on documents not tokens.
texts = [" ".join(tokens) for tokens in text_tokens]

The above code uses thresholds and minimum co-occurrence counts to avoid combining every possible word with every other possible word. So only word-pairs and triplets that commonly are found to occur together are used (replaced) in the document.

Use the Python data processing library Pandas to turn it into suitable format for the machine learning algorithms:

from pandas import DataFrame

df = DataFrame()

df["texts"] = texts
df["text_tokens"] = text_tokens
df["component"] = component_labels

print(df.head(2))

First to have a look at the data:

#how many issues are there in our data for all the target labels, assigned component counts
value_counts = df["component"].value_counts()
#print how many times each component/target label exists in the training data
print(value_counts)
#remove all targets for which we have less than 10 training samples.
#K-fold validation with 5 splits requires min 5 to have 1 in each split. This makes it 2, which is still tiny but at least it sorta runs
indices = df["component"].isin(value_counts[value_counts > 9].index)
#this is the weird syntax i never remember, them python tricks. i think it slices the dataframe to remove the items not in "indices" list
df = df.loc[indices, :]

The above code actually already does a bit more. It also filters the dataset to remove the rows with component values that only have less than 10 items. So this is the unfiltered list:

component: tools              162
component: hal                128
component: export             124
component: networking         118
component: drivers            110
component: rtos                88
component: filesystem          80
component: tls                 78
component: docs                60
component: usb                 54
component: ble                 38
component: events              14
component: cmsis               10
component: stdlib               4
component: rpc                  4
component: uvisor               2
component: greentea-client      2
component: compiler             2

And after filtering, the last four rows will have been removed. So in the end, the dataset will not have any rows with labelsl “rpc”, “uvisor”, “greentea-client”, or “compiler”. This is because I will later use stratified 5-fold cross-validation and this requires a minimum of 5 items of each. Filtering with minimum of 10 instances for a label, it is at least possible to have 2 of the least common “component” in each fold.

In a more realistic case, much more data would be needed to cover all categories, and I would also look at possibly combining some of the different categories. And rebuilding the model every now and then, depending on how much effort it is, how much new data comes in, etc.

To use the “component” values as target labels for machine learning, they need to be numerical (integers). This does the transformation:

from sklearn.preprocessing import LabelEncoder

# encode class values as integers
encoder = LabelEncoder()
encoded_label = encoder.fit_transform(df.component)

Just to see how the mapping of integer id’s to labels after label encoding looks:

unique, counts = np.unique(encoded_label, return_counts=True)
print(unique) #the set of unique encoded labels
print(counts) #the number of instances for each label

The result (first line = id, second line = number of items):

[ 0  1  2  3  4  5  6  7  8  9 10 11 12]
[ 38  10  60 110  14 124  80 128 118  88  78 162  54]

Mapping the labels to integers:

#which textual label/component name matches which integer label
le_name_mapping = dict(zip(encoder.classes_, encoder.transform(encoder.classes_)))
print(le_name_mapping)

#which integer matches which textual label/component name
le_id_mapping = dict(zip(encoder.transform(encoder.classes_), encoder.classes_))
print(le_id_mapping)

So the first is to print “label: id” pairs, and the second to print “id: label” pairs. The first one looks like this:

'component: ble': 0, 
'component: cmsis': 1, 
'component: docs': 2, 
'component: drivers': 3, 
'component: events': 4, 
'component: export': 5, 
'component: filesystem': 6, 
'component: hal': 7, 
'component: networking': 8, 
'component: rtos': 9, 
'component: tls': 10, 
'component: tools': 11, 
'component: usb': 12

Now, to turn the text into suitable input for a machine learning algorithm, I transform the documents into their TF-IDF presentation. Well, if you go all deep learning with LSTM and the like, this may not be necessary. But don’t take my word for it, I am still trying to figure some of that out.

TF-IDF stands for term frequency (TF) – inverse document frequency (IDF). For example, if the word “bob” appears often in a document, it has a high term frequency for that document. Generally, one might consider such a word to describe that document well (or the concepts in the document). However, if the same word also appears commonly in all the documents (in the “corpus”), it is not really specific to that document, and not very representative of that document vs all other documents in the corpus. So IDF is used to modify the TF so that words that appear often in a document but less often in others in the corpus get a higher weight. And if the word appears often across many documents, it gets a lower weight. This is TF-IDF.

Traditional machine learning approaches also require a more fixed size set of input features. Since documents are of varying length, this can be a bit of an issue. Well, I believe some deep learning models also require this (e.g., CNN), while others less so (e.g., sequential models such as LSTM). Digressing. TF-IDF also (as far as I understand) results in a fixed length feature vector for all documents. Or read this on Stack Overflow and make your own mind up.

Anyway, to my understanding, the feature space (set of all features) after TF-IDF processing becomes the set of all unique words across all documents. Each of these is given a TF-IDF score for each document. For the words that do not exist in a document, the score is 0. And most documents don’t have all words in them, so this results in a very “sparse matrix”, where the zeroes are not really stored. That’s how you can actually process some reasonable sized set of documents in memory.

So, in any case, to convert the documents to TF-IDF presentation:

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5)

#transfor all documents into TFIDF vectors.
#TF-IDF vectors are all same length, same word at same index, value is its TFIDF for that word in that document
features_transformed = vectorizer.fit_transform(features)

Above code fits the vectorizer to the corpus and then transforms all the documents to their TF-IDF representations. To my understanding (from SO), the fit part counts the word occurrences in the corpus, and the transform part uses these overall counts to transform each document into TF-IDF.

It is possible also to print out all the words the TF-IDF found in the corpus:

#the TFIDF feature names is a long list of all unique words found
print(vectorizer.get_feature_names())
feature_names = np.array(vectorizer.get_feature_names())
print(len(feature_names))

Now to train a classifier to predict the component based on a given document:

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score

kfold = StratifiedKFold(n_splits=5) #5-fold cross validation

#the classifier to use, the parameters are selected based on a set i tried before
clf = RandomForestClassifier(n_estimators=50, min_samples_leaf=1, min_samples_split=5)

results = cross_val_score(clf, features_transformed, encoded_label, cv=kfold)

print("Baseline: %.2f%% (%.2f%%)" % (results.mean() * 100, results.std() * 100))

#fit the classifier on the TFIDF transformed word features, train it to predict the component
clf.fit(features_transformed, encoded_label)
probabilities = clf.predict_proba(features_transformed[0])
print(probabilities)

In the above I am using RandomForest classifier, with a set of parameters previously tuned. I am also using 5-fold cross validation, meaning the data is split into 5 different parts. The parts are “stratified”, meaning each fold has about the same percentage of each target label as the original set. This is why I removed the labels with less that 10 instances in the beginning, to have at least 2 for each class. Which is till super-tiny but thats what this example is about.

The last part of the code above also runs a prediction on one of the transformed documents just to try it out.

Now, to run predictions on previously unseen documents:

import requests

def predict_component(issue):
	'''
	use this to get a set of predictions for a given issue.

	:param issue: issue id from github.
	:return: list of tuples (component name, probability)
	'''
	#first load text for the given issue from github
	url = "https://api.github.com/repos/ARMmbed/mbed-os/issues/" + str(issue)
	r = requests.get(url)
	url_json = json.loads(r.content)
	print(url_json)
	#process the loaded issue data to format matching what the classifier is trained on
	issue_tokens = preprocess_report(url_json["body"])
	issue_tokens = trigram_mod[bigram_mod[issue_tokens]]
	issue_text = " ".join(issue_tokens)
	features_transformed = vectorizer.transform([issue_text])
	#and predict the probability of each component type
	probabilities = clf.predict_proba(features_transformed)
	result = []
	for idx in range(probabilities.shape[1]):
		name = le_id_mapping[idx]
		prob = (probabilities[0, idx]*100)
		prob_str = "%.2f%%" % prob
		print(name, ":", prob_str)
		result.append((name, prob_str))
	return result

This code takes as parameter an issue number for the ARM mBed Github repo. Downloads the issue data, preprocesses it similar to the training data (clean, tokenize, lemmatize, TF-IDF). This is then used as a set of features to predict the component, based on the model trained earlier.

The “predict_component” method/function can then be called from elsewhere. In my case, I made a simple web page to call it. As noted in the beginning of this post, you can find that webserver code, as well as all the code above on my Github repository.

That’s pretty much it. Not very complicated to put some Python lines one after another, but knowing which lines and in which order is perhaps what takes the time to learn. Having someone else around to do it for you if you are a domain expert (e.g., testing, software engineering or similar in this case) is handy, but it can also be useful to have some idea of what happens, or how the algorithms in general work.

Something I left out in all the above was the code to try out different classifiers and their parameters. So I will just put it below for reference.

First a few helper methods:

def top_tfidf_feats(row, features, top_n=25):
	''' Get top n tfidf values in row and return them with their corresponding feature names.'''
	topn_ids = np.argsort(row)[::-1][:top_n]
	top_feats = [(features[i], row[i]) for i in topn_ids]
	df = pd.DataFrame(top_feats)
	df.columns = ['feature', 'tfidf']
	return df

#this prints it for the first document in the set
arr = features_test_transformed[0].toarray()
top_tfidf_feats(arr[0], feature_names)

def show_most_informative_features(vectorizer, clf, n=20):
	feature_names = vectorizer.get_feature_names()
	coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
	top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])
	for (coef_1, fn_1), (coef_2, fn_2) in top:
		print ("\t%.4f\t%-15s\t\t%.4f\t%-15s" % (coef_1, fn_1, coef_2, fn_2))

In above code, “top_tfidf_feats” prints the top words with highest TF-IDF score for a document. So in a sense, it prints the words that TF-IDF has determined to be most uniquely representing that document.

The “show_most_informative_features” prints the top features that a given classifier has determined to be most descriptive/informative for distinguishing target labels. This only works for certain classifiers, which have such simple co-efficients (feature weights). Such as multinomial naive-bayes (MultinomialNB below).

Here is the code to actually try out the classifiers then:

from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB()
clf.fit(features_train_transformed, labels_train)

from sklearn.metrics import accuracy_score

y_pred = clf.predict(features_test_transformed)
y_true = labels_test
acc_score = accuracy_score(y_true, y_pred)
print("MNB accuracy:"+str(acc_score))

show_most_informative_features(vectorizer, clf)

#try it out on a single document
probabilities = clf.predict_proba(features_test_transformed[0])
print(probabilities)

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score

#set of parameters to try
estimators = [10, 20, 30, 40, 50]
min_splits = [5, 10, 20, 30, 40, 50]
min_leafs = [1, 2, 5, 10, 20, 30]

kfold = StratifiedKFold(n_splits=5) #5-fold cross validation

best_acc = 0.0
best_rf = None
for estimator in estimators:
	for min_split in min_splits:
		for min_leaf in min_leafs:
			print("estimators=", estimator, "min_split=", min_split, " min_leaf=", min_leaf)

			clf = RandomForestClassifier(n_estimators=estimator, min_samples_leaf=min_leaf, min_samples_split=min_split)

			results = cross_val_score(clf, features_transformed, encoded_label, cv=kfold)

			print("Baseline: %.2f%% (%.2f%%)" % (results.mean() * 100, results.std() * 100))

			if results.mean() > best_acc:
				best_acc = results.mean()
				best_rf = clf
				print("found better:", best_acc, ", ", best_rf)

print("best classifier:")
print(best_rf)

best_acc = 0
best_rf = None
for estimator in estimators:
	for min_split in min_splits:
		for min_leaf in min_leafs:
			print("estimators=", estimator, "min_split=", min_split, " min_leaf=", min_leaf)

			clf = RandomForestClassifier(n_estimators=estimator, min_samples_leaf=min_leaf, min_samples_split=min_split)
			clf.fit(features_train_transformed, labels_train)

			pred = clf.predict(features_test_transformed)

			accuracy = accuracy_score(labels_test, pred)

			print(accuracy)

			if accuracy > best_acc:
				best_acc = accuracy
				best_rf = clf
				print("found better:", best_acc, ", ", best_rf)
	

In the code above, I use loops to run through the parameters. There is also something called GridSearch in the Python libraries, as well as RandomSearch (for cases where trying all combos is expensive). But I prefer the ability to control the loops, print out whatever I like and all that.

The above code also shows two ways I tried to train/evaluate the RandomForest parameters. First is with k-fold, latter with single test-train split. I picked MultinomialNB and RandomForest because some internet searching gave me the impression they might work reasonably well for unbalanced class sets such as this one. Of course the final idea is always to try and see what works.. This worked quite fine for me. Or so it seems, machine learning seems to be always about iterating stuffs and learning and updating as you go. More data could change this all, or maybe finding some mistake, or having more domain or analytics knowledge, finding mismatching results, or anything really.

What the unbalanced refers to is the number of instances of different components in this dataset, some “components” have many bug repots, while others much less. For many learning algorithms this seems to be an issue. Some searches indicated RandomForest should be fairly robust for this type so this is also one reason I used it.

Running the above code to experiment with the parameters also produced some slightly concerning results. The accuracy for the classifier ranged from 30% to 95% with smallish parameters changes. I would guess that also speaks for the small dataset causing potential issues. Also re-running the same code would give different classifications for new (unseen) instances. Which is what you might expect when I am not setting the randomization seed. But then I would also expect the accuracy to vary somewhat, which it didn’t. So just don’t take this as more than an example of how you might apply ML for some SW testing related tasks. Take it to highlight the need to always learn more, try more things, and get a decent amount of data, evolve models constantly, etc. And post some comments on all the things you think are wrong in this post/code so we can verify the approach of learning and updating all the time :).

In any case, I hope the example is useful for giving an idea of one way how machine learning could be applied in software testing related aspects. Now write me some nice LSTM or whatever is the latest trend in deep learning models, figure out any issues in my code, or whatever, and post some comments. Cheers.