Tag Archives: lstm

Attention for Bug Predictions

Previously I wrote about building an issue category predictor using LSTM networks on Keras. This was a two-layer bi-directional LSTM network. A neural network architecture that has been gaining some "attention" recently in NLP is Attention. This is simply an approach to have the network pay some more "attention" to specific parts of the input. That’s what I think anyway. So the way to use Attention layers is to add them to other existing layers.

In this post, I look at adding Attention to the network architecture of my previous post, and how this impacts the resulting accuracy and training of the network. Since Keras still does not have an official Attention layer at this time (or I cannot find one anyway), I am using one from CyberZHG’s Github. Thanks for the free code!

Network Configurations

I tried a few different (neural) network architectures with Attention, including the ones from my previous post, with and without Glove word embeddings. In addition to these, I tried with adding a dense layer before the final output layer, after the last attention layer. Just because I head bigger is better :). The maximum model configuration of this network looks like this:

attention model

With a model summary as:

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
Input (InputLayer)           (None, 1000)              0         
_________________________________________________________________
embedding (Embedding)        (None, 1000, 300)         6000000   
_________________________________________________________________
lstm-bi1 (Bidirectional)     (None, 1000, 256)         440320    
_________________________________________________________________
drop1 (Dropout)              (None, 1000, 256)         0         
_________________________________________________________________
seq_self_attention_3 (SeqSel (None, 1000, 256)         16449     
_________________________________________________________________
lstm-bi2 (Bidirectional)     (None, 1000, 128)         164864    
_________________________________________________________________
drop2 (Dropout)              (None, 1000, 128)         0         
_________________________________________________________________
seq_weighted_attention_3 (Se (None, 128)               129       
_________________________________________________________________
mid_dense (Dense)            (None, 128)               16512     
_________________________________________________________________
drop3 (Dropout)              (None, 128)               0         
_________________________________________________________________
output (Dense)               (None, 165)               21285     
=================================================================
Total params: 6,659,559
Trainable params: 659,559
Non-trainable params: 6,000,000
_________________________________________________________________

I reduced from this for a few different configurations to try what would be the impact on the loss and accuracy of the predictions.

I used 6 variants of this. Each had the 2 bi-direction LSTM layers. Variants of this:

  • 2-att-d: each bi-lstm followed by attention. dropout of 0.5 and 0.3 after each bi-lstm
  • 2-att: each bi-lstm followed by attention, no dropout
  • 2-att-d2: each bi-lstm followed by attention, dropout of 0.2 and 0.1 after each bi-lstm
  • 2-att-dd: 2-att-d2 with dense in the end, dropouts 0.2, 0.1, 0.3
  • 1-att-d: 2 bi-directional layers, followed by single attention. dropout 0.2 and 0.1 after each bi-lstm.
  • 1-att: 2 bi-directional layers, followed by single attention. no dropout.

The code for the model definition with all the layers enabled:

    input = Input(shape=(sequence_length,), name="Input")
    embedding = Embedding(input_dim=vocab_size, 
                          weights=[embedding_matrix],
                          output_dim=embedding_dim, 
                          input_length=sequence_length,
                          trainable=embeddings_trainable,
                          name="embedding")(input)
    lstm1_bi1 = Bidirectional(CuDNNLSTM(128, return_sequences=True, name='lstm1'), name="lstm-bi1")(embedding)
    drop1 = Dropout(0.2, name="drop1")(lstm1_bi1)
    attention1 = SeqSelfAttention(attention_width=attention_width)(drop1)
    lstm2_bi2 = Bidirectional(CuDNNLSTM(64, return_sequences=True, name='lstm2'), name="lstm-bi2")(attention1)
    drop2 = Dropout(0.1, name="drop2")(lstm2_bi2)
    attention2 = SeqWeightedAttention()(drop2)
    mid_dense = Dense(128, activation='sigmoid', name='mid_dense')(attention2)
    drop3 = Dropout(0.2, name="drop3")(mid_dense)
    output = Dense(cat_count, activation='sigmoid', name='output')(drop3)
    model = Model(inputs=input, outputs=output)

Results

The 3 tables below summarize the results for the different model configurations, using the different embeddings versions of:

  • Glove initialized embeddings, non-trainable
  • Glove initialized embeddings, trainable
  • Uninitialized embeddings, trainable

Non-trainable glove:

model epoch max-accuracy min-loss
2-att-d 9 0.434 2.3655
2-att 8 0.432 2.4323
2-att 7 0.430 2.3974
2-att-d2 8 0.435 2.3466
2-att-dd 9 0.423 2.4046
1-att 8 0.405 2.3132
1-att-d 8 0.410 2.4817

Trainable glove:

model epoch max-accuracy min-loss
2-att-d 6 0.537 1.9735
2-att 5 0.540 1.9940
2-att 4 0.533 1.9643
2-add-d2 6 0.529 2.0279
2-add-d2 5 0.528 1.9852
2-add-dd 7 0.505 2.2565
1-att 7 0.531 2.0869
1-att 5 0.530 1.9927
1-att-d 6 0.526 2.0180

Trainable uninitialized:

model epoch max-accuracy min-loss
2-att-d 7 0.510 2.1198
2-att-d 6 0.502 2.106
2-att 7 0.522 2.0915
2-att-d2 8 0.504 2.2104
2-att-d2 6 0.490 2.1684
2-att-dd 9 0.463 2.4869
1-att 8 0.510 2.1413
1-att 7 0.505 2.1138
1-att-d 8 0.517 2.1552

The training curves in general look similar to this (picked from one of the best results):

attention results

So not too different from my previous results.

At least with this type of results, it is nice to see a realistic looking training + validation accuracy and loss curve, with training going up and crossing validation at some point close to where overfitting starts. I have recently done a lot of Kaggle with the intent to learn how to use all these things in practice. And I think Kaggle is really great place for this. However, the competitions seem to be geared to trick you somehow, the given training vs test sets are usually really weirdly skewed, and the competition on getting those tiny fractions of accuracy great. So compared to that this is a nice, fresh view on the real world being more sane for ML application. But I digress..

Summary insights from the above:

  • The tables above show how adding Attention does increase the accuracy by about 10%, from bit below 50% to about 54% in the best case.
  • Besides the impect from adding Attention, the rest of the configurations seem merely fiddling with it, without much impact on accuracy or loss.
  • 2 Attention layers are slightly better than one
  • Dropout quite consistently has a small negative impact on performance. As I did not try that many configurations, maybe it could be improved by tuning.
  • Final dense layer here mainly has just a negative impact
  • The Kaggle kernels I ran this with have this nasty habit of sometimes cutting the output for some parts of the code. In this case it consistently cut the un-trainable Glove version output at around 9th epoch, which is why all those in the above tables are listed as best around 8th or 9th epoch. It might have shown small gains for one or two more epochs still. However, it was plateauing so strong already I do not think it is a big issue for these results.
  • Due to training with smaller batch size taking longer, I had to limit epochs to 10 from previous post’s 15. On the other hand, Attention seems to converge faster, so not that bad a tradeoff.

Attention: My Final Notes

I used the Attention layer from the Github I linked. Very nice work in that Github in many related accounts BTW. Definitely worth checking out. This layer seems very useful. however, it seems to be tailored to the Github owners specific needs and not documented in much detail There seem to be some different variants of Attention layers I found around the interents, some only working on previous versions of Keras, others requiring 3D input, others only 2D.

For example, the above Attention layer works on the 3D inputs. SeqSelfAttention takes input as 3D sequences, outputs 3D sequences. SeqWeightedAttention takes input as 3D, outputs 2D. There is at least one implementation being copy-pasted around in Kaggle kernels that uses 2D inputs and outputs. Some other custom Keras layers seem to have gone stale. Another I found on Github seems promising but has not been updated. One of the issues links to a patched version though. But in any case, my goal was not to compare different custom implementations, so I will just wait for the final and play with this for now.

As noted, I ran these experiments on Kaggle kernels. At the time they were running on NVidia P100 GPU’s, which are intended to be datacenter scale products. These have 16GB GPU memory, which at this time is a lot. Using the two attention layers I described above, I managed to exhaust this memory quite easily. This is maybe because I used a rather large sequence length of 1000 timesteps (words). The model summary I printed above shows the Attention layers having only 16449 and 129 parameters to train, so the implementation must otherwise require plenty of space. Not that I understand the details at such depth, but something to consider.

Some of the errors I got for setting up these Attention layers also seemed to indicate it was building a 4D representation by adding another layer (of size 1000) on top of the layer it was paying attention to (the bi-LSTM in this case). This sort of makes sense, considering if it takes a 3D input (as LSTM sequence output) and pays attention to it. This attention window is just one parameter that could be tuned in this Attention implementation I used, so a better understanding of this implementation and its tuning parameters/options/impacts would likely be useful and maybe help with many of my issues.

Overall, as far as I understand, using a smaller number of timesteps is quite common. Likely using fewer would give very good results still but allow for more freedom to experiment with other parts of the model architecture without runnign out of memory. The memory issue required me to run with a much smaller batch size of 48 down from 128 and higher from before. This has yet again the effect of slowing performance as with smaller batch size takes longer to process the whole dataset.

"Official" support for Attention has been a long time coming (well, in terms of DL frameworks anyway..), and seems to be quite awaited feature (so the search-engines tell me). The comments I link above (in the Keras issue tracker on Github) also seem to contain various proposals for implementation. Perhaps the biggest issue still being the need to figure out how the Keras team wants to represent Attention to users, and how to make it as easy to use (and I suppose effective) as possible. Still, over these years of people waiting, maybe it would be nice to have something and build on that? Of course, as a typical OSS customer, I expect to have all this for free, so that is my saltmine.. 🙂

Some best practices / easy to understand documentation I would like to see:

  • Tradeoffs in using different types of Attention: 3D, 2D, attention windows, etc.
  • Attention in multi-layer architectures, where does it make the most sense and why (intuitively too)
  • Parameter explanations and tuning experiences / insights (e.g., attention window size)
  • Other general use types in different types of networks

Bug Report Classifier with LSTM on Keras

I previously did a review on applications of machine learning in software testing and network analysis. I was looking at updating that, maybe with some extra focus. As usual, I got distracted. This time to build an actual system to do some of the tasks discussed in those reviews. This post discusses how I built a bug report classifier based on bugreport descriptions. Or more generally, they are issues listed in a public Jira, but nevermind..

The classifier I built here is based on bi-directional LSTM (long short-term memory) networks using Keras (with Tensorflow). So deep learning, recurrent neural networks, word embeddings. Plenty of trendy things to see here.

Getting some data

The natural place to go looking for this type of data is open source projects and their bug data bases. I used the Qt project bug tracker (see, even the address has word "bug" in it, not "issue"). It seems to be based on the commonly used Jira platform. You can go to the web site and select some project, fill in filters, click on export, and directly get a CSV formatted output file that can be simply imported into Pandas and thus Python ML and data analytics libraries. This is what I did.

Since the export interface only allows for downloading data for 1000 reports at once, I scripted it. Using Selenium Webdriver I automated filling in download filters one month at a time. This script is stored in my ML-experiments Github. Along with a script that combines all the separate downloads into one CSV file. Hopefully I don’t move these around the repo too much and keep breaking the links.

Some points in building such a downloader:

  • Disable save dialog in browser via Selenium settings
  • Autosave on
  • Wait for download to complete by scanning partial files or otherwise
  • Rename latest created file according to filtered time
  • Check filesizes, bug creation dates and index continuity to see if something was missing

Exploring the data

Before running in full speed to build some classifier, it is generally good to explore the data a bit and see what it looks like, what can be learned, etc. Commonly this is called exploratory data analysis (EDA).

First, read the data and set dates to date types to enable date filters:

df_bugs = pd.read_csv("bugs/reduced.csv", 
             parse_dates=["Created", "Due Date", "Resolved"])
print(df_bugs.columns)

This gives 493 columns:

Index(['Unnamed: 0', 'Affects Version/s', 'Affects Version/s.1',
       'Affects Version/s.10', 'Affects Version/s.11', 
       'Affects Version/s.12',
       'Affects Version/s.13', 'Affects Version/s.14', 
       'Affects Version/s.15', 'Affects Version/s.16',
       ...
       'Status', 'Summary', 'Time Spent', 'Updated', 'Votes',
       'Work Ratio', 'Σ Original Estimate', 'Σ Remaining Estimate',
       'Σ Time Spent'], dtype='object', length=493)

This is a lot of fields for a bug report. The large count is because Jira seems to dump multiple valued items into multiple columns. The above snippet shows an example of "Affects version" being split to multiple columns. If one bug has at most 20 affected versions, then all exported rows will have 20 columns for "Affects version". So one item per column if there are many. A simple way I used to combine them was by the count of non-null values:

#this dataset has many very sparse columns, 
#where each comment is a different field, etc.
#this just sums all these comment1, comment2, comment3, ... 
#as a count of such items
def sum_columns(old_col_name, new_col_id):
	old_cols = [col for col in all_cols if old_col_name in col]
	olds_df = big_df[old_cols]
	olds_non_null = olds_df.notnull().sum(axis=1)
	big_df.drop(old_cols, axis=1, inplace=True)
	big_df[new_col_id] = olds_non_null

#just showing two here as an example
sum_columns("Affects Version", "affects_count")
sum_columns("Comment", "comment_count")
...
print(df_bugs.columns)
Index(['Unnamed: 0', 'Assignee', 'Created', 'Creator', 
       'Description',
       'Due Date', 'Environment', 'Issue Type', 'Issue id', 
       'Issue key',
       'Last Viewed', 'Original Estimate', 'Parent id', 'Priority',
       'Project description', 'Project key', 'Project lead', 
       'Project name', 'Project type', 'Project url', 
       ''Remaining Estimate', 'Reporter',
       'Resolution', 'Resolved', 'Security Level', 'Sprint', 
       'Sprint.1',
       'Status', 'Summary', 'Time Spent', 'Updated', 'Votes', 
       'Work Ratio', 'Σ Original Estimate', 
       'Σ Remaining Estimate', 'Σ Time Spent',
       'outward_count', 'custom_count', 'comment_count', 
       'component_count',
       'labels_count', 'affects_count', 'attachment_count',
       'fix_version_count', 'log_work_count'],
      dtype='object')

So, that would be 45 columns after combining several of the counts. Down from 493, and makes it easier to find bugs with most votes, comments, etc. This enables views such as:

df_bugs.sort_values(by="Votes", ascending=False)
        [["Issue key", "Summary", "Issue Type", 
          "Status", "Votes"]].head(10)

most voted

In a similar way, bug type counts:

order = ['P0: Blocker', 'P1: Critical', 'P2: Important',
         'P3: Somewhat important','P4: Low','P5: Not important',
         'Not Evaluated']

df_2019["Priority"].value_counts().loc[order]
       .plot(kind='bar', figsize=(10,5))

issue priorities

In addition, I ran various other summaries and visualizations on it to get a bit more familiar with the data.

The final point was to build a classifier and see how well that does. A classifier needs a classification target. I went with the assigned component. So, my classifier tries to predict the component to assign bug report to, using only the bug reports natural language description.

To start with, a look at the components. A bug report in this dataset can be assigned to multiple components. Similar to the "Affects version" above.

The distribution looks like this:

df_2019["component_count"].value_counts().sort_index()
1     64499
2      5616
3       596
4        64
5        10
6         5
7         1
8         3
9         1
10        3
11        1

This shows having and issue assigned to more than 2 components being rare, and more than 3 very rare. For this experiment, I only collected the first two components the bugs were assigned to (if any). Of those, I simply used the first assigned component as the training target label. Some further data could be had by adding training set items with labels also for second and third components. Or for all of them if feeling like it. But the first component served good enough for this experiment.

How many unique ones are in those first two?

values = set(df_2019["comp1"].unique())
values.update(df_2019["comp2"].unique())
len(values)
172

So there would be 172 components to predict. And how does their issue count distribution look like?

counts = df_2019["comp1"].value_counts()
counts
1.  QML: Declarative and Javascript Engine     5260
2.  Widgets: Widgets and Dialogs               4547
3.  Documentation                              3352
4.  Quick: Core Declarative QML                2037
5.  Qt3D                                       1928
6.  QPA: Other                                 1898
7.  Build tools: qmake                         1862
8.  WebEngine                                  1842
9.  Packaging & Installer                      1803
10. Build System                               1801
11. Widgets: Itemviews                         1534
12. GUI: Painting                              1480
13. Multimedia                                 1478
14. GUI: OpenGL                                1462
15. Quick: Controls                            1414
16. GUI: Text handling                         1378
17. Core: I/O                                  1265
18. Device Creation                            1255
19. Quick: Controls 2                          1173
20. GUI: Font handling                         1141
...
155. GamePad                                     14
156. KNX                                         12
157. QPA: Direct2D                               11
158. ODF Writer                                   9
157. Network: SPDY                                8
158. GUI: Vulkan                                  8
159. Tools: Qt Configuration Tool                 7
160. QPA: KMS                                     6
161. Extras: X11                                  6
162. PIM: Versit                                  5
163. Cloud Messaging                              5
164. Testing: QtUITest                            5
165. Learning/Course Material                     4
166. PIM: Organizer                               4
167. SerialBus: Other                             3
168. Feedback                                     3
169. Systems: Publish & Subscribe                 2
170. Lottie                                       2
171. CoAP                                         1
172. Device Creation: Device Utilities            1

Above shows how the issue count distribution is very unbalanced.

To summarize the above, there are 172 components, with largely uneven distributions. Imagine trying to predict the correct copmponent from 172 options, given that for some of them, there is very limited data available. Would seem very difficult to learn to distinguish the ones with very little training data. I guess this skewed distribution might be due to new components having little data on them. Which, in a more realistic-scenario, would merit some additional consideration. Maybe collecting these all to a new category like "Other, manually check". And updating the training data constantly, re-training the model as new issues/bugs are added. Well, that is probably a good idea anyway.

Besides these components with very few linked issues, there are several in the dataset marked as "Inactive". These would likely also be beneficial to remove from the training set, since we would not expect to see any new ones coming for them. I did not do it, as for my experiment this is fine even without. In any case, this is what is looks like:

df_2019[df_2019["comp1"].str.contains("Inactive")]["comp1"].unique()
array(['(Inactive) Porting from Qt 3 to Qt 4',
       '(Inactive) GUI: QWS Integration (Qt4)', '(Inactive) Phonon',
       '(Inactive) mmfphonon', '(Inactive) Maemo 5', 
       '(Inactive) OpenVG',
       '(Inactive) EGL/Symbian', '(Inactive) QtQuick (version 1)',
       '(Inactive) Smart Installer ', '(Inactive) JsonDB',
       '(Inactive) QtPorts: BB10', '(Inactive) Enginio'], dtype=object)

I will use the "description" column for the features (the words in the description), and the filtered "comp1" column shown above for the target.

Creating an LSTM Model for Classification

This classifier code is available on Github. As shown above, some of the components have very few values (issues to train on). Longish story shorter, I cut out the targets with 10 or less values:

min_count = 10
df_2019 = df_2019[df_2019['comp1']
              .isin(counts[counts >= min_count].index)]

This enabled me to do a 3-way train-validation-test set split and still have some data for each 3 splits for each target component. A 3-way stratified split that is. Code:

def train_val_test_split(X, y):
    X_train, X_test_val, y_train, y_test_val = 
       train_test_split(X, y, test_size=0.2, 
                        random_state=42, stratify=y)
    X_val, X_test, y_val, y_test = 
       train_test_split(X_test_val, y_test_val, test_size=0.25,
                        random_state=42, stratify=y_test_val)
    return X_train, y_train, X_val, y_val, X_test, y_test

Before using that, I need to get the data to train, that is the X (features) and y (target).

To get the features, tokenize the text. For an RNN the input data needs to be a fixed length vector (of tokens), so cut the document at seq_length if longer, or pad it to length if shorter. This uses Keras tokenizer, which I guess should be quite confident to produce suitble output for Keras..

def tokenize_text(vocab_size, texts, seq_length):
    tokenizer = Tokenizer(num_words=vocab_size)
    tokenizer.fit_on_texts(texts)
    sequences = tokenizer.texts_to_sequences(texts)

    word_index = tokenizer.word_index
    print('Found %s unique tokens.' % len(word_index))

    X = pad_sequences(sequences, maxlen=seq_length)
    print('Shape of data tensor:', X.shape)

    return data, X, tokenizer

That produces the X. To produce the y:

from sklearn.preprocessing import LabelEncoder

df_2019.dropna(subset=['comp1', "Description"], inplace=True)
# encode class values as integers 
# so they work as targets for the prediction algorithm
encoder = LabelEncoder()
df_2019["comp1_label"] = encoder.fit_transform(df_2019["comp1"])

The "comp1_label" in above now has the values for the target y variable.

To put these together:

data = df_2019["Description"]
vocab_size = 20000
seq_length = 1000
data, X, tokenizer = tokenize_text(vocab_size, data, seq_length)

y = df_2019["comp1_label"]
X_train, y_train, X_val, y_val, X_test, y_test = 
                            train_val_test_split(X, y)

The 3 sets of y_xxxx variables still need to be converted to Keras format, which is a one-hot encoded 2D-matrix. To do this after the split:

y_train = to_categorical(y_train)
y_val = to_categorical(y_val)
y_test = to_categorical(y_test)

Word Embeddings

I am using Glove word vectors. In this case the relatively small set based on 6 billion tokens (words) with 300 dimensions. The vectors are stored in a text file, with one word per line, along with the vector values. First item on line is the word, followed by the 300 dimensional vector values for it. So the following loads this into the embedding_index dictionary, keys being words and values the vectors.

def load_word_vectors(glove_dir):
    print('Indexing word vectors.')

    embeddings_index = {}
    f = open(os.path.join(glove_dir, 'glove.6B.300d.txt'),
             encoding='utf8')
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs
    f.close()

    print('Found %s word vectors.' % len(embeddings_index))
    return embeddings_index

With these loaded, convert the embedding index into matrix form that the Keras Embedding layer uses. This simply puts the embedding vector for each word at a specific index in the matrix. So if word "bob" is in index 10 in word_index, the embedding vector for "bob" will be in embedding_matrix[10].

def embedding_index_to_matrix(embeddings_index, vocab_size,
                              embedding_dim, word_index):
    print('Preparing embedding matrix.')

    # prepare embedding matrix
    num_words = min(vocab_size, len(word_index))
    embedding_matrix = np.zeros((num_words, embedding_dim))
    for word, i in word_index.items():
        if i >= vocab_size:
            continue
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            # words not found in embedding index will be all-zeros.
            embedding_matrix[i] = embedding_vector
    return embedding_matrix

Build the bi-LSTM model

I use two model versions here. The first one uses the basic LSTM layer from Keras. The second one uses the Cuda optimized CuDNNLSTM layer. I used the CuDNNLSTM layer to train this model on GPU, saved the weights after training, and then loaded the weights into the plain LSTM version. I then used the plain LSTM version to do the predictions on my laptop when developing and demoing this.

Plain LSTM version:

def build_model_lstm(vocab_size, embedding_dim, 
                     embedding_matrix, sequence_length, cat_count):
    input = Input(shape=(sequence_length,), name="Input")
    embedding = Embedding(input_dim=vocab_size, 
                          weights=[embedding_matrix],
                          output_dim=embedding_dim,
                          input_length=sequence_length,
                          trainable=False,
                          name="embedding")(input)
    lstm1_bi1 = Bidirectional(LSTM(128, return_sequences=True,
                      name='lstm1'), name="lstm-bi1")(embedding)
    drop1 = Dropout(0.2, name="drop1")(lstm1_bi1)
    lstm2_bi2 = Bidirectional(LSTM(64, return_sequences=False,
                      name='lstm2'), name="lstm-bi2")(drop1)
    drop2 = Dropout(0.2, name="drop2")(lstm2_bi2)
    output = Dense(cat_count, 
                   activation='sigmoid', name='sigmoid')(drop2)
    model = Model(inputs=input, outputs=output)
    model.compile(optimizer='adam', 
          loss='categorical_crossentropy', metrics=['accuracy'])
    return model

CuDNNLSTM version:

def build_model_lstm_cuda(vocab_size, embedding_dim, 
                 embedding_matrix, sequence_length, cat_count):
    input = Input(shape=(sequence_length,), name="Input")
    embedding = Embedding(input_dim=vocab_size,
                          output_dim=embedding_dim, 
                          weights=[embedding_matrix],
                          input_length=sequence_length,
                          trainable=False,
                          name="embedding")(input)
    lstm1_bi1 = Bidirectional(CuDNNLSTM(128, 
                              return_sequences=True, name='lstm1'),
                              name="lstm-bi1")(embedding)
    drop1 = Dropout(0.2, name="drop1")(lstm1_bi1)
    lstm2_bi2 = Bidirectional(CuDNNLSTM(64, 
                              return_sequences=False, name='lstm2'),
                              name="lstm-bi2")(drop1)
    drop2 = Dropout(0.2, name="drop2")(lstm2_bi2)
    output = Dense(cat_count, 
                   activation='sigmoid', name='sigmoid')(drop2)
    model = Model(inputs=input, outputs=output)
    model.compile(optimizer='adam', 
          loss='categorical_crossentropy', metrics=['accuracy'])
    return model

The structure of the above models visualizes to this:

model structure

The first layer is an embedding layer, and uses the embedding matrix from the pre-trained Glove vectors. This is followed by the two bi-LSTM layers, each with a Dropout layer behind it. The bi-LSTM layer looks at each word along with its context (as I discussed previously). The dropout layers help avoid overfitting. Finally, a dense layer is used to make the prediction from "cat_count" categories. Here cat_count is the number of categories to predict. It is actually categories and not cats, sorry about that.

The "weights=[embedding_matrix]" parameter given to the Embedding layer is what can be used to initialize the pre-trained word-vectors. In this case, those would be the Glove word-vectors. The current Keras Embedding docs say nothing about this parameter, which is a bit weird. Searching for this on the internet also seems to indicate it would be deprecated etc. but it also seems difficult to find a simple replacement. But it works, so I go with that..

In a bit more detail, the model summarizes to this:

model.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
Input (InputLayer)           (None, 1000)              0         
_________________________________________________________________
embedding (Embedding)        (None, 1000, 300)         6000000   
_________________________________________________________________
lstm-bi1 (Bidirectional)     (None, 1000, 256)         440320    
_________________________________________________________________
drop1 (Dropout)              (None, 1000, 256)         0         
_________________________________________________________________
lstm-bi2 (Bidirectional)     (None, 128)               164864    
_________________________________________________________________
drop2 (Dropout)              (None, 128)               0         
_________________________________________________________________
sigmoid (Dense)              (None, 165)               21285     
=================================================================
Total params: 6,626,469
Trainable params: 626,469
Non-trainable params: 6,000,000
_________________________________________________________________

This shows how the embedding layer turns the input into a suitable shape for LSTM input as I discussed in my previous post. That is, 1000 timesteps, each with 300 features. Those being the 1000 tokens for each document (issue report description) and 300 dimensional word-vectors for each token.

Another interesting point is the text at the end of the summary: "Non-trainable params: 6,000,000". This matches the number of parameters in the summary for the embedding layer. When the embedding layer is given the paremeter "trainable=False", all the parameters in it are fixed. If this is set to True, then all these parameters will be trainable as well.

Training it

Training the model is simple now that everything is set up:

checkpoint_callback = ModelCheckpoint(filepath=
   "./model-weights-issue-pred.{epoch:02d}-{val_loss:.6f}.hdf5",
   monitor='val_loss', verbose=0, save_best_only=True)

model = build_model_lstm_cuda(vocab_size=vocab_size,
                    embedding_dim=embedding_dim,
                    sequence_length=seq_length,
                    embedding_matrix=embedding_matrix,
                    cat_count=len(encoder.classes_)

history = model.fit(X_train, y_train,
          batch_size=128,
          epochs=15,
          validation_data=(X_val, y_val),
          callbacks=callbacks)

model.save("issue_model_word_embedding.h5")

score, acc = model.evaluate(x=X_test,
                            y=y_test,
                            batch_size=128)
print('Test loss:', score)
print('Test accuracy:', acc)

Notice that I use the build_model_lstm_cuda() version here. That is to train on the GPU environment, to have some sensible training time.

The callback given will just save the model weights when the validation score improves. In this case it monitors the validation loss getting smaller (no mode="max" as in my previous version).

Predicting

Prediction with a trained model:

def predict(bug_description, seq_length):
    #texts_to_sequences vs text_to_word_sequence?
    sequences = tokenizer.texts_to_sequences([bug_description])

    word_index = tokenizer.word_index
    print('Found %s unique tokens.' % len(word_index))

    X = pad_sequences(sequences, maxlen=seq_length)

    probs = model.predict(X)
    result = []
    for idx in range(probs.shape[1]):
        name = le_id_mapping[idx]
        prob = (probs[0, idx]*100)
        prob_str = "%.2f%%" % prob
        #print(name, ":", prob_str)
        result.append((name, prob))
    return result

Running this on the bug QTBUG-74496 gives the following top predictions:

Quick: Other: 0.3271
(Inactive) QtQuick (version 1): 0.4968
GUI: Drag and Drop: 0.7292
QML: Declarative and Javascript Engine: 2.6450
Quick: Core Declarative QML : 5.8533

The bigger number signified higher likelihood given by the classifier. This highlights many of the topics I mentioned above. There is one inactive component there, which relates to perhaps it being better to remove all inactive ones from training set. The top one presented (Quick: Core Declarative QML) is not the one assigned to the report at this time, but the second highest is (QML: Declarative and Javascript Engine). They both seem to be associated with the same top-level component (QML) and I do not have the expertise to say why one might be better than the other in this case.

In most of the issue reports I tried this on, it seemed to get the "correct" one as marked on the issue tracker. In the ones that did not match, the suggestion always seemed to make sense (sometimes more than what had been set by whoever sets the value), and commonly in case of mismatch, the "correct" one was in the top suggestions still. But overall, component granularity might useful to consider as well in building these types of classifiers and their applications.

Usefulness of pre-trained word-embeddings

When doing the training, I started with using the Glove embeddings, and sometimes just accidentally left them out and trained without them. This reminded me to do an experiment to see how much effect using the pre-trained embeddings actually have, and how does the accuracy etc. get affected with or without them. So I tried to train the model with different options for the Embedding layer:

  • fixed Glove vectors (trainable=False)
  • trainable Glove vectors (trainable=True)
  • trainable, non-initialized vectors (trainable=True, no Glove)

The training accuracy/loss curves look like this:

embedding accuracies

The figure is a bit small but you can hopefully expand it. At least I uploaded it bigger :).

These results can be summarized as:

  • Non-trainable Glove alone improves all the way for the 15 iterations (epochs) and reaches validation accuracy of about 0.4 and loss around 2.55. The improvements got really small there so I did not try further epochs.
  • Trainable, uninitialized (no Glove) got best validation accuracy/loss at epoch 11 for 0.488 accuracy and 2.246 loss. After this it overfits.
  • Trainable with Glove initialization reaches best validation accuracy/loss at epoch 8 for 0.497 accuracy and 2.154 loss. After this it overfits.

Some interesting points I gathered from this:

  • Overall, non-trainable Glove gives quite poor results (but I guess still quite usable) over trainable embeddings in the two other options.
  • Glove initialized but further trained embeddings converge much faster and get better scores.
  • I guess further trained Glove embeddings would be a form of "transfer" learning. Cool, can I put that in my CV now?

My guess is that the bug descriptions have many terms not commonly used in general domains, which causes the effect of requiring updates of the general Glove embeddings to be more effective. When I did some exploratory analysis of the data (e.g., TF-IDF across components) these types of terms were actually quite visible. However, they are intermixed with the general terms, which benefit from Glove and mixing the two gives best results. Just my "educated" guess. Another of my guesses is that, this would be quite a similar result in other domains as well, with domain specific terminology.

General Notes (or "Discussion" in academic terms..)

The results from the training and validation showed about 50% accuracy. In a binary classification problem this would not be so great. More like equal to random guessing. But with 160+ targets to choose from, this seems to me to be very good. The loss is maybe a better metric, but I am not that good at interpreting the loss against 160+ categories. Simply smaller is better, and it should overall signify how far off the predictions for categories are. (but how to measure and interpret the true distance of all, when you just give one as correct target label, you tell me?)

Also, as noted earlier, an issue can be linked to several components. And from my tries of running this with new data and comparing results, the mapping is not always very clear, and there can be multiple "correct" answers. This is also shown in my prediction example above. The results given by Keras predict() are actually listing the probabilities for each of the 160+ potential targets. So if accuracy just measures the likelihood of getting it exactly right, it misses the ones that are predicted at position 2, 3, and so on. One way I would see using this would be to provide assistance on selections to an expert analyzing the incoming bug reports. In such case, having "correct" answers even in the few top predictions would seem useful. Again, as shown in my prediction example above.

With that, the 50% chance of getting it 100% correct for the top prediction actually seems very good. Consider that there are over 160 possible targets to predict. So not nearly as simple as a binary classifier, where 50% would match random guessing. Here this is much better than that.

Besides the LSTM, I also tried a more traditional classifier on this same task. I tried several versions, include multinomial naive bayes, random forest, and lgbm. The features I provided were heavily pre-processed word tokens in their Tf-IDF format. That classifier was pretty poor, perhaps even useless poor. Largely perhaps due to my lack of skills in feature engineering any domain, and in lack of hyperparameter optimization. But it was still the case. With that background, I was surprised to see good performance from the LSTM version.

Overall, the predictions give by this classifier I built are not perfect but much closer than I expected to get. Out of 160+ categories getting most time the correct one, and often close to the correct one, based only on the natural language description was a very nice result for me.

In cases where the match is not perfect, I believe it still provides valuable input. Either the match is given in the other top suggestions, or maybe we can learn something about considering why some of the others are suggested, and is there some real meaning behind it. All the mismatches I found made sense when considering the reported issue from different angles. I would guess the same might hold for other domains and similar classifiers as well.

Many other topics to investigate would include:

  • different types of model architectures (layers, neuron counts, GRU, 1D CNN, …)
  • Attention layers (Keras still does not include support but they are very popular now in NLP)
  • Different dimensions of embeddings
  • Different embeddings initializers (Word2Vec)
  • effects of more preprocessing
  • N-way cross-validation
  • training the final classifier on the whole training data at once, after finishing the model tuning etc.

Learning to LSTM

What is this

This is about time-series prediction/classification with neural networks using Keras. I will not go into theory or description of recurrent neural nets or LSTM itself, rather there are plenty tutorials out there. Search engines give plenty more. Try some if not already familiar. I just try to focus on what I found confusing after reading those, and how did that go.

EDIT: Links Kaggle kernel, Github

Overly long intro

Recently I have been trying to learn to use LSTM (Long Short Term Memory) networks. I picked up the Kaggle VSB Power Line Fault Detection competition. The point of this competition was to classify power line signals as faulty or not faulty. The given data looks like this:

signal plot

Just based on the raw signal data, the challenge in this competition was to classify the signal as fault/not faulty.

As shown in the figure above, there are 800k values per signal in the data. There are 8712 signals in the training set. These signals are for 3-phase power measures. Always 3 signals together to form a single 3-phase "measurement" or observation. The total number of such 3-phase groups is 8712/3 = 2904. The data looks something like this:

train_meta = pd.read_csv("../input/metadata_train.csv")
#There are 3 rows/signals per measurement, so 6 prints first 2
train_meta.head(6)

The shape of the (training) data is 800k rows and 8712 columns, one column for each signal, one row for each signal value measured in 20 millisecond intervals.

df_sig.shape

(800000, 8712)

Few rows for the two of the first 3-signal sets (columns 1-3 and 4-6):

Goal is to use this raw data to identify fault patterns. LSTM networks were very popular in this competition as the data is a set of 8172 time-series instances. So good place to learn how to use LSTM.

I like Kaggle in general for this, as there are good kernels to get started, and discussion on what works. Of course, I always do poor on the competitions but I find it good practice anyway.

Input shape

One thing I found confusing, and based on my internet searches, many others too, is how to shape and form the input for the LSTM network. To train it, and to run the "predictions".

The input shape should be a 3-dimensional array of (number of observations, number of timesteps, number of features). Using the data from above, what would that be?

Each signal measurement would be one observation. In the way above data is structured, this would be 8712 observations, 800000 timesteps, and 1 feature (the raw value). So input shape would be (8712, 800000, 1).

This is because we have:

  • 8712 separate signal observations,
  • 800000 measurements for each signal (at 20ms intervals, so a time-series),
  • 1 feature, which is just the raw measurement value for the signal.

Number of timesteps

I initially tried to train the network with this setup. Didn’t go very well. Found some links that say the number of timesteps should be much less, such as below 200, max around 250-500, 10 steps or less, or up to 1000 timesteps. Not really sure what, but certainly much less than 800k.

One of the most popular kernels in the competition was using a 160 timestep version. How do you go from 800k timesteps to 160 timesteps? By "binning" the 800k into 160 buckets. The size of such bucket is 800000 / 160 = 5000. In Python this looks something like this:

bkt_count = 160
data_size = 800000
bkt_size = int(data_size/bkt_count)
for i in range(0, data_size, bkt_size):
    # cut data to bkt_size (bucket size)
    bkt_data_raw = data_sig[i:i + bkt_size]
    #now here bkt_data_raw needs to be summarized

The above would process the data into a set of 160 buckets, so producing the 160 timesteps.

Features

We can then calculate various summary statistics for these "buckets" as features, for example:

bkt_data_raw = data_sig[i:i + bkt_size]
bkt_avg_raw = bkt_data_raw.mean() #1
bkt_sum_raw = bkt_data_raw.sum() #1
bkt_std_raw = bkt_data_raw.std() #1
bkt_std_top = bkt_avg_raw + bkt_std_raw #1
bkt_std_bot = bkt_avg_raw - bkt_std_raw #1
bkt_percentiles = np.percentile(bkt_data_raw, [0, 1, 25, 50, 75, 99, 100]) #7

The above gives 1+1+1+1+1+7 = 12 features. I had a few more features, and experimented with various others, but this works to illustrate the concept.

So, assume we have the 160 timesteps with 12 summary features each. This would mean the LSTM input shape is now: (8712, 160, 12).

Transforming the input to 3-dimensions

Assume we loop through all the data (800k values) for a signal, and turn each into 160 rows with 12 features. How do we turn that into (8712, 160, 12) for the LSTM? The key (for me) is in the numpy.reshape() function. We can do something like this:

X = df_train.values.reshape(measures, timesteps, total_features)

This works nicely because underneath most things in Python ML libraries rely on Numpy arrays (or arrays of arrays etc.). Assuming that df_train in the code above has a number of values that is divisible by the given reshape sizes, Numpy will reshape (as the name says..) it into a set of multi-dimensional arrays as requested.

In the above the number of elements for one signal is 160*12=1920. Because a signal now has 160 timesteps (buckets) each with 12 features. Since we know the number of rows (measured signals) we expect to have, the total can be calculated as 8172*1920 = 15690240.

There are many ways to build the initial set of features. I started with putting all the features for a timestep on a single row. So I had rows with 1920 columns. 8172 such rows. Reshaping in the end worked fine and produced the correct shape of (8172,160,12).

However, with such a shape, it was a bit difficult to apply Pandas operations on the data, such as scaling all the features using sklearn scalers. Pandas operations expect a dataframe to have 2-dimensional data, where each column has data for a single feature. For example, the scalers then scale one column at a time. With 1920 columns, each feature data was at every 12th column.

To address this, I ended up with data in the format of 160 * 8172 = 1307520 rows. Each row with N columns, where N equals the number of features I had. In this case 12. The 160 timesteps for each signal were each timestep on a single row, and 160 rows following each other for one signal. Immediately followed by another 160 for the next signal, and so on. Final size in this case being a dataframe with a shape of (1307520,12). As calculated above this is the 160 timesteps for all 8172 rows, meaning 160 * 8172 = 1307520 rows.

This would look something like this for the first signal:

and for the second signal:

and so on for all 8172 signals in chunks of 160 rows.. So each feature in its own column, each timestep on its own row.

Since all the features are in the same column for all the signals and all timesteps, general transformation tools such as sklearn scalers can be applied very simply:

minmax = MinMaxScaler(feature_range=(-1,1))
df_train_scaled = pd.DataFrame(minmax.fit_transform(df_train))

This scales all the features for all timesteps and all observations (signals here) at once correctly. Well, I think it does anyway, so correct me if wrong :).

The scaled version for the 1st signal:

And to reshape it into the 3-dimensional array shaped (8172, 160, 12):

observations = 8172
timesteps = 160
total_features = 12
X = df_train_scaled.values.reshape(observations, timesteps, total_features)

The dataframe.values is a Numpy array, so the Numpy.reshape function is there.

X would now be suitable to use as input for an LSTM network. The one I have not shown how to build yet. Only a million words and stuff so far but no LSTM and this was to be all about LSTM. Nice. Good job.

Parallelizing pre-processing

On Kaggle, I eventually built my own kernel for this pre-processing to create features and scale them, and to save the results for later model experiments. It uses Python multiprocessing to create the buckets and features. I found this to have multiple benefits:

  • The transformation of the raw input data can be done once and used many times to experiment with different LSTM and other ML models and NN architectures. Since pre-processing large datasets takes time, this speeds up the process immensely.

  • Pandas is built to be single-core (as is Python itself..), so you have to do tricks like this or else your multiple cores are just sitting idly and wasted.

  • Kaggle specific: By running preprocessing in a separate kernel, I can run it in parallel in one kernel while experimenting with models in other kernels.

  • Kaggle specific: Kaggle CPU kernels have 4 CPU cores, allowing 2*faster preprocessing than in GPU kernels which have only 2 CPU cores. But you need GPU kernels to build LSTM models.

In some parallel architectures like PySpark this would be less of a problem, but I do not have access to such systems, so I work with what I have, huh. Maybe someday if I am rich enough or whatever. Plz send food, beer, monies, and epic computer parts, thx.

Building the LSTM model

I made several attempts at building a model. Here is one quite directly based on the popular Kaggle kernel I linked at the beginning:

def create_model4(input_data):
    input_shape = input_data.shape
    inp = Input(shape=(input_shape[1], input_shape[2],), name="input_signal")
    x = Bidirectional(CuDNNLSTM(128, return_sequences=True, name="lstm1"), name="bi1")(inp)
    x = Bidirectional(CuDNNLSTM(64, return_sequences=False, name="lstm2"), name="bi2")(x)
    x = Dense(128, activation="relu", name="dense1")(x)
    x = Dense(64, activation="relu", name="dense2")(x)
    x = Dense(1, activation='sigmoid', name="output")(x)
    model = Model(inputs=inp, outputs=x)
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=[matthews_correlation])
    return model

Few points here:

  • CuDNNLSTM is a specialized LSTM layer optimized for NVidia GPU’s. I always though Keras would come with automatically chosen optimal GPU implementation and pick it automatically based on backend. Seems not.

  • CuDNN is NVIDIA’s Deep Neural Network library.

  • Bidirectional LSTM refers to a network using information about "past and future". This is only useful if you actually know the future, as in translating text from one language to another and you know what words will follow the current word. In this case, I know the whole signal, so the LSTM can also look at what values follow the current "bin" value in the following bins (timesteps).

  • The "return_sequences" flag tells the layer to give out a similar 3D array as the original input. So each of the 128 cells in the first layer will not just output on value for the sequence but also output an intermediate value for each timestep. So instead of shape (128), the output would be (160, 128).

  • The second LSTM layer outputs a shape of (64), which is just the final output of the LSTM processing the timesteps. This is suitable shape for the following dense layer, since it is just an array.

But let’s see how the model actually looks like.

Inspecting the model

First, a one-way model to keep it a bit simpler still:

def create_model3(input_data):
    input_shape = input_data.shape
    inp = Input(shape=(input_shape[1], input_shape[2],), name="input_signal")
    x = CuDNNLSTM(128, return_sequences=True, name="lstm1")(inp)
    x = CuDNNLSTM(64, return_sequences=False, name="lstm2")(x)	    
    x = Dense(128, activation="relu", name="dense1")(x)
    x = Dense(64, activation="relu", name="dense2")(x)
    x = Dense(1, activation='sigmoid', name="output")(x)
    model = Model(inputs=inp, outputs=x)
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=[matthews_correlation])
    return model

Summarizing the model:

model.summary()

Gives:

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_signal (InputLayer)    (None, 160, 12)           0         
_________________________________________________________________
lstm1 (CuDNNLSTM)            (None, 160, 128)          72704     
_________________________________________________________________
lstm2 (CuDNNLSTM)            (None, 64)                49664     
_________________________________________________________________
dense1 (Dense)               (None, 128)               8320      
_________________________________________________________________
dense2 (Dense)               (None, 64)                8256      
_________________________________________________________________
output (Dense)               (None, 1)                 65        
=================================================================
Total params: 139,009
Trainable params: 139,009
Non-trainable params: 0
_________________________________________________________________

And to visualize this (in Jupyter):

from keras.utils import plot_model
plot_model(model, show_shapes=True, to_file="lstm.png")
from IPython.display import Image
Image("lstm.png")

Giving us:

This shows how the LSTM parameters affect the basic structure of the network.

  • input_signal: layer shows the shape of the input data: 160 time steps, 12 features. None for any number of rows (observations).

  • lstm1: 128 LSTM units, with return_sequences=True. Now the output is (None, 160, 128), where 128 matches the number of LSTM units, and replaces the number of features in the input. So 128 features, each one produced by a single LSTM "unit".

  • lstm2: 64 LSTM units, with return_sequences=False. Output shape is (None, 64). Outputting values at every timestep is disabled (return_sequences=False), so this is just 2 dimensional output. 64 features, one for each LSTM "unit". This for each row in the input as it comes through lstm1.

  • dense1, dense2, output: regular dense network layers.

And the same for a bi-directional LSTM:

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_signal (InputLayer)    (None, 160, 60)           0         
_________________________________________________________________
bi1 (Bidirectional)          (None, 160, 256)          194560    
_________________________________________________________________
bi2 (Bidirectional)          (None, 128)               164864    
_________________________________________________________________
dense1 (Dense)               (None, 128)               16512     
_________________________________________________________________
dense2 (Dense)               (None, 64)                8256      
_________________________________________________________________
output (Dense)               (None, 1)                 65        
=================================================================
Total params: 384,257
Trainable params: 384,257
Non-trainable params: 0
_________________________________________________________________

So bi-directional doubles the number of features the LSTM layers produce, as it goes both forward and backward across the timesteps.

Training and predicting

Training and predicting with the LSTM model is no different from others, but a quick look anyway.

Training:

splits = list(StratifiedKFold(n_splits=N_SPLITS, shuffle=True, random_state=2019).split(X, y))
preds_val = []
y_val = []
for idx, (train_idx, val_idx) in enumerate(splits):
    K.clear_session()
    print("Beginning fold {}".format(idx+1))
    train_X, train_y, val_X, val_y = X[train_idx], y[train_idx], X[val_idx], y[val_idx]
    model = model_lstm(train_X.shape)
    ckpt = ModelCheckpoint('weights_{}.h5'.format(idx), save_best_only=True, save_weights_only=True, verbose=1, monitor='val_matthews_correlation', mode='max')
    model.fit(train_X, train_y, batch_size=128, epochs=50, validation_data=[val_X, val_y], callbacks=[ckpt])
    model.load_weights('weights_{}.h5'.format(seed, idx))

Prediction:

preds = []
for i in range(N_SPLITS):
    model.load_weights('weights_{}.h5'.format(i))
    pred = model.predict(X_test_input, batch_size=300, verbose=1)
    pred_3 = []
    for pred_scalar in pred:
        for i in range(3):
            pred_3.append(pred_scalar)
    preds.append(pred_3)
threshold = 0.5
preds_test = (np.squeeze(np.mean(preds_test, axis=0)) > threshold).astype(np.int)
submission['target'] = preds
submission.to_csv('submission_{}.csv'.format(seed), index=False)
submission.head()

With this, I managed to train the LSTM and have it produce actually meaningful results. Not that I got anywhere in the competition but it was a good experience for me, since my first tries I produced some completely broken input data, formatted wrong and all. The training with that broken data format delivered zero results, which might be expected since the features were all wrong, and data for them was mixed up. I feel I mostly learned to use an LSTM, even if I failed the Kaggle competition for the smallest fractions of target metric.

From all this, I wandered a bit to look at other things, such as what is the actual structure of an LSTM network when build with Keras, how to combine other types of data and layers alongside LSTM in the same network, and what is adversarial validation. Maybe I will manage to write a bit about them next.