Learning to LSTM

What is this

This is about time-series prediction/classification with neural networks using Keras. I will not go into theory or description of recurrent neural nets or LSTM itself, rather there are plenty tutorials out there. Search engines give plenty more. Try some if not already familiar. I just try to focus on what I found confusing after reading those, and how did that go.

EDIT: Links Kaggle kernel, Github

Overly long intro

Recently I have been trying to learn to use LSTM (Long Short Term Memory) networks. I picked up the Kaggle VSB Power Line Fault Detection competition. The point of this competition was to classify power line signals as faulty or not faulty. The given data looks like this:

signal plot

Just based on the raw signal data, the challenge in this competition was to classify the signal as fault/not faulty.

As shown in the figure above, there are 800k values per signal in the data. There are 8712 signals in the training set. These signals are for 3-phase power measures. Always 3 signals together to form a single 3-phase "measurement" or observation. The total number of such 3-phase groups is 8712/3 = 2904. The data looks something like this:

train_meta = pd.read_csv("../input/metadata_train.csv")
#There are 3 rows/signals per measurement, so 6 prints first 2
train_meta.head(6)

The shape of the (training) data is 800k rows and 8712 columns, one column for each signal, one row for each signal value measured in 20 millisecond intervals.

df_sig.shape

(800000, 8712)

Few rows for the two of the first 3-signal sets (columns 1-3 and 4-6):

Goal is to use this raw data to identify fault patterns. LSTM networks were very popular in this competition as the data is a set of 8172 time-series instances. So good place to learn how to use LSTM.

I like Kaggle in general for this, as there are good kernels to get started, and discussion on what works. Of course, I always do poor on the competitions but I find it good practice anyway.

Input shape

One thing I found confusing, and based on my internet searches, many others too, is how to shape and form the input for the LSTM network. To train it, and to run the "predictions".

The input shape should be a 3-dimensional array of (number of observations, number of timesteps, number of features). Using the data from above, what would that be?

Each signal measurement would be one observation. In the way above data is structured, this would be 8712 observations, 800000 timesteps, and 1 feature (the raw value). So input shape would be (8712, 800000, 1).

This is because we have:

  • 8712 separate signal observations,
  • 800000 measurements for each signal (at 20ms intervals, so a time-series),
  • 1 feature, which is just the raw measurement value for the signal.

Number of timesteps

I initially tried to train the network with this setup. Didn’t go very well. Found some links that say the number of timesteps should be much less, such as below 200, max around 250-500, 10 steps or less, or up to 1000 timesteps. Not really sure what, but certainly much less than 800k.

One of the most popular kernels in the competition was using a 160 timestep version. How do you go from 800k timesteps to 160 timesteps? By "binning" the 800k into 160 buckets. The size of such bucket is 800000 / 160 = 5000. In Python this looks something like this:

bkt_count = 160
data_size = 800000
bkt_size = int(data_size/bkt_count)
for i in range(0, data_size, bkt_size):
    # cut data to bkt_size (bucket size)
    bkt_data_raw = data_sig[i:i + bkt_size]
    #now here bkt_data_raw needs to be summarized

The above would process the data into a set of 160 buckets, so producing the 160 timesteps.

Features

We can then calculate various summary statistics for these "buckets" as features, for example:

bkt_data_raw = data_sig[i:i + bkt_size]
bkt_avg_raw = bkt_data_raw.mean() #1
bkt_sum_raw = bkt_data_raw.sum() #1
bkt_std_raw = bkt_data_raw.std() #1
bkt_std_top = bkt_avg_raw + bkt_std_raw #1
bkt_std_bot = bkt_avg_raw - bkt_std_raw #1
bkt_percentiles = np.percentile(bkt_data_raw, [0, 1, 25, 50, 75, 99, 100]) #7

The above gives 1+1+1+1+1+7 = 12 features. I had a few more features, and experimented with various others, but this works to illustrate the concept.

So, assume we have the 160 timesteps with 12 summary features each. This would mean the LSTM input shape is now: (8712, 160, 12).

Transforming the input to 3-dimensions

Assume we loop through all the data (800k values) for a signal, and turn each into 160 rows with 12 features. How do we turn that into (8712, 160, 12) for the LSTM? The key (for me) is in the numpy.reshape() function. We can do something like this:

X = df_train.values.reshape(measures, timesteps, total_features)

This works nicely because underneath most things in Python ML libraries rely on Numpy arrays (or arrays of arrays etc.). Assuming that df_train in the code above has a number of values that is divisible by the given reshape sizes, Numpy will reshape (as the name says..) it into a set of multi-dimensional arrays as requested.

In the above the number of elements for one signal is 160*12=1920. Because a signal now has 160 timesteps (buckets) each with 12 features. Since we know the number of rows (measured signals) we expect to have, the total can be calculated as 8172*1920 = 15690240.

There are many ways to build the initial set of features. I started with putting all the features for a timestep on a single row. So I had rows with 1920 columns. 8172 such rows. Reshaping in the end worked fine and produced the correct shape of (8172,160,12).

However, with such a shape, it was a bit difficult to apply Pandas operations on the data, such as scaling all the features using sklearn scalers. Pandas operations expect a dataframe to have 2-dimensional data, where each column has data for a single feature. For example, the scalers then scale one column at a time. With 1920 columns, each feature data was at every 12th column.

To address this, I ended up with data in the format of 160 * 8172 = 1307520 rows. Each row with N columns, where N equals the number of features I had. In this case 12. The 160 timesteps for each signal were each timestep on a single row, and 160 rows following each other for one signal. Immediately followed by another 160 for the next signal, and so on. Final size in this case being a dataframe with a shape of (1307520,12). As calculated above this is the 160 timesteps for all 8172 rows, meaning 160 * 8172 = 1307520 rows.

This would look something like this for the first signal:

and for the second signal:

and so on for all 8172 signals in chunks of 160 rows.. So each feature in its own column, each timestep on its own row.

Since all the features are in the same column for all the signals and all timesteps, general transformation tools such as sklearn scalers can be applied very simply:

minmax = MinMaxScaler(feature_range=(-1,1))
df_train_scaled = pd.DataFrame(minmax.fit_transform(df_train))

This scales all the features for all timesteps and all observations (signals here) at once correctly. Well, I think it does anyway, so correct me if wrong :).

The scaled version for the 1st signal:

And to reshape it into the 3-dimensional array shaped (8172, 160, 12):

observations = 8172
timesteps = 160
total_features = 12
X = df_train_scaled.values.reshape(observations, timesteps, total_features)

The dataframe.values is a Numpy array, so the Numpy.reshape function is there.

X would now be suitable to use as input for an LSTM network. The one I have not shown how to build yet. Only a million words and stuff so far but no LSTM and this was to be all about LSTM. Nice. Good job.

Parallelizing pre-processing

On Kaggle, I eventually built my own kernel for this pre-processing to create features and scale them, and to save the results for later model experiments. It uses Python multiprocessing to create the buckets and features. I found this to have multiple benefits:

  • The transformation of the raw input data can be done once and used many times to experiment with different LSTM and other ML models and NN architectures. Since pre-processing large datasets takes time, this speeds up the process immensely.

  • Pandas is built to be single-core (as is Python itself..), so you have to do tricks like this or else your multiple cores are just sitting idly and wasted.

  • Kaggle specific: By running preprocessing in a separate kernel, I can run it in parallel in one kernel while experimenting with models in other kernels.

  • Kaggle specific: Kaggle CPU kernels have 4 CPU cores, allowing 2*faster preprocessing than in GPU kernels which have only 2 CPU cores. But you need GPU kernels to build LSTM models.

In some parallel architectures like PySpark this would be less of a problem, but I do not have access to such systems, so I work with what I have, huh. Maybe someday if I am rich enough or whatever. Plz send food, beer, monies, and epic computer parts, thx.

Building the LSTM model

I made several attempts at building a model. Here is one quite directly based on the popular Kaggle kernel I linked at the beginning:

def create_model4(input_data):
    input_shape = input_data.shape
    inp = Input(shape=(input_shape[1], input_shape[2],), name="input_signal")
    x = Bidirectional(CuDNNLSTM(128, return_sequences=True, name="lstm1"), name="bi1")(inp)
    x = Bidirectional(CuDNNLSTM(64, return_sequences=False, name="lstm2"), name="bi2")(x)
    x = Dense(128, activation="relu", name="dense1")(x)
    x = Dense(64, activation="relu", name="dense2")(x)
    x = Dense(1, activation='sigmoid', name="output")(x)
    model = Model(inputs=inp, outputs=x)
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=[matthews_correlation])
    return model

Few points here:

  • CuDNNLSTM is a specialized LSTM layer optimized for NVidia GPU’s. I always though Keras would come with automatically chosen optimal GPU implementation and pick it automatically based on backend. Seems not.

  • CuDNN is NVIDIA’s Deep Neural Network library.

  • Bidirectional LSTM refers to a network using information about "past and future". This is only useful if you actually know the future, as in translating text from one language to another and you know what words will follow the current word. In this case, I know the whole signal, so the LSTM can also look at what values follow the current "bin" value in the following bins (timesteps).

  • The "return_sequences" flag tells the layer to give out a similar 3D array as the original input. So each of the 128 cells in the first layer will not just output on value for the sequence but also output an intermediate value for each timestep. So instead of shape (128), the output would be (160, 128).

  • The second LSTM layer outputs a shape of (64), which is just the final output of the LSTM processing the timesteps. This is suitable shape for the following dense layer, since it is just an array.

But let’s see how the model actually looks like.

Inspecting the model

First, a one-way model to keep it a bit simpler still:

def create_model3(input_data):
    input_shape = input_data.shape
    inp = Input(shape=(input_shape[1], input_shape[2],), name="input_signal")
    x = CuDNNLSTM(128, return_sequences=True, name="lstm1")(inp)
    x = CuDNNLSTM(64, return_sequences=False, name="lstm2")(x)	    
    x = Dense(128, activation="relu", name="dense1")(x)
    x = Dense(64, activation="relu", name="dense2")(x)
    x = Dense(1, activation='sigmoid', name="output")(x)
    model = Model(inputs=inp, outputs=x)
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=[matthews_correlation])
    return model

Summarizing the model:

model.summary()

Gives:

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_signal (InputLayer)    (None, 160, 12)           0         
_________________________________________________________________
lstm1 (CuDNNLSTM)            (None, 160, 128)          72704     
_________________________________________________________________
lstm2 (CuDNNLSTM)            (None, 64)                49664     
_________________________________________________________________
dense1 (Dense)               (None, 128)               8320      
_________________________________________________________________
dense2 (Dense)               (None, 64)                8256      
_________________________________________________________________
output (Dense)               (None, 1)                 65        
=================================================================
Total params: 139,009
Trainable params: 139,009
Non-trainable params: 0
_________________________________________________________________

And to visualize this (in Jupyter):

from keras.utils import plot_model
plot_model(model, show_shapes=True, to_file="lstm.png")
from IPython.display import Image
Image("lstm.png")

Giving us:

This shows how the LSTM parameters affect the basic structure of the network.

  • input_signal: layer shows the shape of the input data: 160 time steps, 12 features. None for any number of rows (observations).

  • lstm1: 128 LSTM units, with return_sequences=True. Now the output is (None, 160, 128), where 128 matches the number of LSTM units, and replaces the number of features in the input. So 128 features, each one produced by a single LSTM "unit".

  • lstm2: 64 LSTM units, with return_sequences=False. Output shape is (None, 64). Outputting values at every timestep is disabled (return_sequences=False), so this is just 2 dimensional output. 64 features, one for each LSTM "unit". This for each row in the input as it comes through lstm1.

  • dense1, dense2, output: regular dense network layers.

And the same for a bi-directional LSTM:

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_signal (InputLayer)    (None, 160, 60)           0         
_________________________________________________________________
bi1 (Bidirectional)          (None, 160, 256)          194560    
_________________________________________________________________
bi2 (Bidirectional)          (None, 128)               164864    
_________________________________________________________________
dense1 (Dense)               (None, 128)               16512     
_________________________________________________________________
dense2 (Dense)               (None, 64)                8256      
_________________________________________________________________
output (Dense)               (None, 1)                 65        
=================================================================
Total params: 384,257
Trainable params: 384,257
Non-trainable params: 0
_________________________________________________________________

So bi-directional doubles the number of features the LSTM layers produce, as it goes both forward and backward across the timesteps.

Training and predicting

Training and predicting with the LSTM model is no different from others, but a quick look anyway.

Training:

splits = list(StratifiedKFold(n_splits=N_SPLITS, shuffle=True, random_state=2019).split(X, y))
preds_val = []
y_val = []
for idx, (train_idx, val_idx) in enumerate(splits):
    K.clear_session()
    print("Beginning fold {}".format(idx+1))
    train_X, train_y, val_X, val_y = X[train_idx], y[train_idx], X[val_idx], y[val_idx]
    model = model_lstm(train_X.shape)
    ckpt = ModelCheckpoint('weights_{}.h5'.format(idx), save_best_only=True, save_weights_only=True, verbose=1, monitor='val_matthews_correlation', mode='max')
    model.fit(train_X, train_y, batch_size=128, epochs=50, validation_data=[val_X, val_y], callbacks=[ckpt])
    model.load_weights('weights_{}.h5'.format(seed, idx))

Prediction:

preds = []
for i in range(N_SPLITS):
    model.load_weights('weights_{}.h5'.format(i))
    pred = model.predict(X_test_input, batch_size=300, verbose=1)
    pred_3 = []
    for pred_scalar in pred:
        for i in range(3):
            pred_3.append(pred_scalar)
    preds.append(pred_3)
threshold = 0.5
preds_test = (np.squeeze(np.mean(preds_test, axis=0)) > threshold).astype(np.int)
submission['target'] = preds
submission.to_csv('submission_{}.csv'.format(seed), index=False)
submission.head()

With this, I managed to train the LSTM and have it produce actually meaningful results. Not that I got anywhere in the competition but it was a good experience for me, since my first tries I produced some completely broken input data, formatted wrong and all. The training with that broken data format delivered zero results, which might be expected since the features were all wrong, and data for them was mixed up. I feel I mostly learned to use an LSTM, even if I failed the Kaggle competition for the smallest fractions of target metric.

From all this, I wandered a bit to look at other things, such as what is the actual structure of an LSTM network when build with Keras, how to combine other types of data and layers alongside LSTM in the same network, and what is adversarial validation. Maybe I will manage to write a bit about them next.

3 thoughts on “Learning to LSTM

    1. I guess I could if it is of interest. I will look at cleaning it up a bit and update with link to github when done. Maybe next week…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s