A noob’s guide to implementing RNN-LSTM using Tensorflow

Monik — Sun, 19 Jun 2016 19:33:09 +0000

The purpose of this tutorial is to help anybody write their first RNN LSTM model without much background in Artificial Neural Networks or Machine Learning. The discussion is not centered around the theory or working of such networks but on writing code for solving a particular problem. We will understand how neural networks let us solve some problems effortlessly, and how they can be applied to a multitude of other problems.

What are RNNs?

Simple multi-layered neural networks are classifiers which when given a certain input, tag the input as belonging to one of the many classes. They are trained using the existing backpropagation algorithms. These networks are great at what they do but they are not capable of handling inputs which come in a sequence. For example, for a neural net to identify the nouns in a sentence, having just the word as input is not helpful at all. A lot of information is present in the context of the word which can only be determined by looking at the words near the given word. The entire sequence is to be studied to determine the output. This is where Recurrent Neural Networks (RNNs) find their use. As the RNN traverses the input sequence, output for every input also becomes a part of the input for the next item of the sequence. You can read more about the utility of RNNs in Andrej Karpathy’s brilliant blog post. It is helpful to note the ‘recurrent’ property of the network, where the previous output for an input item becomes a part of the current input which comprises the current item in the sequence and the last output. When done over and over, the last output would be the result of all the previous inputs and the last input.

What is LSTM?

RNNs are very apt for sequence classification problems and the reason they’re so good at this is that they’re able to retain important data from the previous inputs and use that information to modify the current output. If the sequences are quite long, the gradients (values calculated to tune the network) computed during their training (backpropagation) either vanish (multiplication of many 0 < values < 1) or explode (multiplication of many large values) causing it to train very slowly.

Long Short Term Memory is a RNN architecture which addresses the problem of training over long sequences and retaining memory. LSTMs solve the gradient problem by introducing a few more gates that control access to the cell state. You could refer to Colah’s blog post which is a great place to understand the working of LSTMs. If you didn’t get what is being discussed, that’s fine and you can safely move to the next part.

The task

Given a binary string (a string with just 0s and 1s) of length 20, we need to determine the count of 1s in a binary string. For example, “01010010011011100110” has 11 ones. So the input for our program will be a string of length twenty that contains 0s and 1s and the output must be a single number between 0 and 20 which represents the number of ones in the string. Here is a link to the complete gist, in case you just want to jump at the code.

Even an amateur programmer can’t help but giggle at the task definition. It won’t take anybody more than a minute to execute this program and get the correct output on every input (0% error).

count = 0
for i in input_string:
    if i == '1':
        count+=1

Anybody in their right mind would wonder, if it is so easy, why the hell can’t a computer figure it out by itself? Computers aren’t that smart without a human instructor. Computers need to be given precise instructions and the ‘thinking’ has to be done by the human issuing the commands. Machines can repeat the most complicated calculations a gazillion times over but they still fail miserably at things humans do painlessly, like recognizing cats in a picture.

What we plan to do is to feed neural network enough input data and tell it the correct output values for those inputs. Post that, we will give it input that it has not seen before and we will see how many of those does the program get right.

Generating the training input data

Each input is a binary string of length twenty. The way we will represent it will be as a python list of 0s and 1s. The test input to be used for training will contain many such lists.

import numpy as np
from random import shuffle

train_input = ['{0:020b}'.format(i) for i in range(2**20)]
shuffle(train_input)
train_input = [map(int,i) for i in train_input]
ti  = []
for i in train_input:
    temp_list = []
    for j in i:
            temp_list.append([j])
    ti.append(np.array(temp_list))
train_input = ti

There can be a total of 2²⁰ ~ 10⁶ combinations of 1s and 0s in a string of length 20. We generate a list of all the 2²⁰ numbers, convert it to their binary string and shuffle the entire list. Each binary string is then converted to a list of 0s and 1s. Tensorflow requires input as a tensor (a Tensorflow variable) of the dimensions [batch_size, sequence_length, input_dimension] (a 3d variable). In our case, batch_size is something we’ll determine later but sequence_length is fixed at 20 and input_dimension is 1 (i.e each individual bit of the string). Each bit will actually be represented as a list containing just that bit. A list of 20 such lists will form a sequence which we convert to a numpy array. A list of all such sequences is the value of train_input that we’re trying to compute. If you print the first few values of train_input, it would look like

[
 array([[0],[0],[1],[0],[0],[1],[0],[1],[1],[0],[0],[0],[1],[1],[1],[1],[1],[1],[0],[0]]), 
 array([[1],[1],[0],[0],[0],[0],[1],[1],[1],[1],[1],[0],[0],[1],[0],[0],[0],[1],[0],[1]]), 
 .....
]

Don’t worry about the values if they don’t match yours because they will be different as they are in random order.

Generating the training output data

For every sequence, the result can be anything between 0 and 20. So we have 21 choices per sequence. Very clearly, our task is a sequence classification problem. Each sequence belongs to the class number which is the same as the count of ones in the sequence. The representation of the output would be a list of the length of 21 with zeros at all positions except a one at the index of the class to which the sequence belongs.

[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
 0  1  2  3  4  5  6  7  8  9  10 11 12 13 14 15 16 17 18 19 20
This is a sample output for a sequence which belongs to 4th class i.e has 4 ones

More formally, this is called the one hot encoded representation.

train_output = []

for i in train_input:
    count = 0
    for j in i:
        if j[0] == 1:
            count+=1
    temp_list = ([0]*21)
    temp_list[count]=1
    train_output.append(temp_list)

For every training input sequence, we generate an equivalent one hot encoded output representation.

Generating the test data

For any supervised machine learning task, we need some data as training data to teach our program to identify the correct outputs and some data as test data to check how our program performs on inputs that it hasn’t seen before. Letting test and training data overlap is self-defeating because, if you had already practiced the questions that were to come in your exam, you would most definitely ace it. Currently in our train_input and train_output, we have 2²⁰ (1,048,576) unique examples. We will split those into two sets, one for training and the other for testing. We will take 10,000 examples (0.9% of the entire data) from the dataset and use it as training data and use the rest of the 1,038,576 examples as test data.

NUM_EXAMPLES = 10000
test_input = train_input[NUM_EXAMPLES:] 
test_output = train_output[NUM_EXAMPLES:] #everything beyond 10,000

train_input = train_input[:NUM_EXAMPLES]
train_output = train_output[:NUM_EXAMPLES] #till 10,000

Designing the model

This is the most important part of the tutorial. Tensorflow and various other libraries (Theano, Torch, PyBrain) provide tools for users to design the model without getting into the nitty-gritty of implementing the neural network, the optimization or the backpropagation algorithm.

Danijar outlines a great way to organize Tensorflow models which you might want to use later to organize tidy up your code. For the purpose of this tutorial, we will skip that and focus on writing code that just works.

Import the required packages to begin with. If you haven’t already installed Tensorflow, follow the instructions on this page and then continue.

import tensorflow as tf

After importing the tensorflow, we will define two variables which will hold the input data and the target data.

data = tf.placeholder(tf.float32, [None, 20,1]) 
target = tf.placeholder(tf.float32, [None, 21])

The dimensions for data are [Batch Size, Sequence Length, Input Dimension]. We let the batch size be unknown and to be determined at runtime. Target will hold the training output data which are the correct results that we desire. We’ve made Tensorflow placeholders which are basically just what they are, placeholders that will be supplied with data later.

Now we will create the RNN cell. Tensorflow provides support for LSTM, GRU (slightly different architecture than LSTM) and simple RNN cells. We’re going to use LSTM for this task.

num_hidden = 24
cell = tf.nn.rnn_cell.LSTMCell(num_hidden,state_is_tuple=True)

For each LSTM cell that we initialise, we need to supply a value for the hidden dimension, or as some people like to call it, the number of units in the LSTM cell. The value of it is it up to you, too high a value may lead to overfitting or a very low value may yield extremely poor results. As many experts have put it, selecting the right parameters is more of an art than science.

Before we write any more code, it is imperative to understand how Tensorflow computation graphs work. From a hacker perspective, it is enough to think of it as having two phases. The first phase is building the computation graph where you define all the calculations and functions that you will execute during runtime. The second phase is the execution phase where a Tensorflow session is created and the graph that was defined earlier is executed with the data we supply.

val, state = tf.nn.dynamic_rnn(cell, data, dtype=tf.float32)

We unroll the network and pass the data to it and store the output in val. We also get the state at the end of the dynamic run as a return value but we discard it because every time we look at a new sequence, the state becomes irrelevant for us. Please note, writing this line of code doesn’t mean it is executed. We’re still in the first phase of designing the model. Think of these as functions that are stored in variables which will be invoked when we start a session.

val = tf.transpose(val, [1, 0, 2])
last = tf.gather(val, int(val.get_shape()[0]) - 1)

We transpose the output to switch batch size with sequence size. After that we take the values of outputs only at sequence’s last input, which means in a string of 20 we’re only interested in the output we got at the 20th character and the rest of the output for previous characters is irrelevant here.

weight = tf.Variable(tf.truncated_normal([num_hidden, int(target.get_shape()[1])]))
bias = tf.Variable(tf.constant(0.1, shape=[target.get_shape()[1]]))

What we want to do is apply the final transformation to the outputs of the LSTM and map it to the 21 output classes. We define weights and biases, and multiply the output with the weights and add the bias values to it. The dimension of the weights will be num_hidden X number_of_classes. Thus on multiplication with the output (val), the resulting dimension will be batch_size X number_of_classes which is what we are looking for.

prediction = tf.nn.softmax(tf.matmul(last, weight) + bias)

After multiplying the output with the weights and adding the bias, we will have a matrix with a variety of different values for each class. What we are interested in is the probability score for each class i.e the chance that the sequence belongs to a particular class. We then calculate the softmax activation to give us the probability scores.

What is this function and why are we using it?

This function takes in a vector of values and returns a probability distribution for each index depending upon its value. This function returns a probability scores (sum of all the values equate to one) which is the final output that we need. If you want to learn more about softmax, head over to this link.

cross_entropy = -tf.reduce_sum(target * tf.log(tf.clip_by_value(prediction,1e-10,1.0)))

The next step is to calculate the loss or in less technical words, our degree of incorrectness. We calculate the cross entropy loss (more details here) and use that as our cost function. The cost function will help us determine how poorly or how well our predictions stack against the actual results. This is the function that we are trying to minimize. If you don’t want to delve into the technical details, it is okay to just understand what cross entropy loss is calculating. The log term helps us measure the degree to which the network got it right or wrong. Say for example, if the target was 1 and the prediction is close to one, our loss would not be much because the values of -log(x) where x nears 1 is almost 0. For the same target, if the prediction was 0, the cost would increase by a huge amount because -log(x) is very high when x is close to zero. Adding the log term helps in penalizing the model more if it is terribly wrong and very little when the prediction is close to the target. The last step in model design is to prepare the optimization function.

optimizer = tf.train.AdamOptimizer()
minimize = optimizer.minimize(cross_entropy)

Tensorflow has a few optimization functions like RMSPropOptimizer, AdaGradOptimizer, etc. We choose AdamOptimzer and we set minimize to the function that shall minimize the cross_entropy loss that we calculated previously.

Calculating the error on test data

mistakes = tf.not_equal(tf.argmax(target, 1), tf.argmax(prediction, 1))
error = tf.reduce_mean(tf.cast(mistakes, tf.float32))

This error is a count of how many sequences in the test dataset were classified incorrectly. This gives us an idea of the correctness of the model on the test dataset.

Execution of the graph

We’re done with designing the model. Now the model is to be executed!

init_op = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init_op)

We start a session and initialize all the variables that we’ve defined. After that, we begin our training process.

batch_size = 1000
no_of_batches = int(len(train_input)/batch_size)
epoch = 5000
for i in range(epoch):
    ptr = 0
    for j in range(no_of_batches):
        inp, out = train_input[ptr:ptr+batch_size], train_output[ptr:ptr+batch_size]
        ptr+=batch_size
        sess.run(minimize,{data: inp, target: out})
    print "Epoch - ",str(i)
incorrect = sess.run(error,{data: test_input, target: test_output})
print('Epoch {:2d} error {:3.1f}%'.format(i + 1, 100 * incorrect))
sess.close()

We decide the batch size and divide the training data accordingly. I’ve fixed the batch size at 1000 but you would want to experiment by changing it to see how it impacts your results and training time.

If you are familiar with stochastic gradient descent, this idea would seem fairly simple. Instead of updating the values after running it through all the training samples, we break the training set into smaller batches and run it for those. After processing each batch, the values of the network are tuned. So every few steps, the network weights are adjusted. Stochastic optimization methods are known to perform better than their counterparts for certain functions. This is because the stochastic methods converge much faster but this may not always be the case.

For every batch, we get the input and output data and we run minimize, the optimizer function to minimize the cost. All the calculation of prediction, cost and backpropagation is done by tensorflow. We pass the feed_dict in sess.run along with the function. The feed_dict is a way of assigning data to tensorflow variables in that frame. So we pass the input data along with target (correct) outputs. The functions that we wrote above, are now being executed.

That’s all. We’ve made our toy LSTM-RNN that learns to count just by looking at correct examples! This wasn’t very intuitive to me when I trained it for the first time, so I added this line of code below the error calculation that would print the result for a particular example.

    print sess.run(model.prediction,{data: [[[1],[0],[0],[1],[1],[0],[1],[1],[1],[0],[1],[0],[0],[1],[1],[0],[1],[1],[1],[0]]]})

So as the model trains, you will notice how the probability score at the correct index in the list gradually increases. Here’s a link to the complete gist of the code.

Concerns regarding the training data

Many would ask, why use a training data set which is just 1% of the all the data. Well, to be able to train it on a CPU with a single core, a higher number would increase the time exponentially. You could of course adjust the batch size to still keep the number of updates same but the final decision is always up to the model designer. Despite everything, you will be surprised with the results when you realize that 1% of the data was enough to let the network achieve stellar results!

Tinkering with the model

You can try changing the parameter values to see how it affects the performance and training time. You can also try adding multiple layers to the RNN to make your model more complex and enable it to learn more features. An important feature you can implement is to add the ability to save the model values after every few iterations and retrieve those values to perform predictions in future. You could also change the cell from LSTM to GRU or a simple RNN cell and compare the performance.

Results

Training the model with 10,000 sequences, batch size of 1,000 and 5000 epochs on a MacbookPro/8GB/2.4Ghz/i5 and no GPU took me about 3-4 hours. And now the answer to the question, everybody is waiting for. How well did it perform?

Epoch 5000 error 0.1%

For the final epoch, the error rate is 0.1% across the entire (almost so because our test data is 99% of all possible combinations) dataset! This is pretty close to what somebody with the least programming skills would have been able to achieve (0% error). But, our neural network figured that out by itself! We did not instruct it to perform any of the counting operations.

If you want to speed up the process, you could try reducing the length of the binary string and adjusting the values elsewhere in the code to make it work.

What can you do now?

Now that you’ve implemented your LSTM model, what else is there that you can do? Sequence classification can be applied to a lot of different problems, like handwritten digit recognition or even autonomous car driving! Think of the rows of the image as individual steps or inputs and the entire image to be the sequence. You must classify the image as belonging to one of the classes which could be to halt, accelerate, turn left, turn right or continue at same speed. Training data could be a stopper but hell, you could even generate it yourself. There is so much more waiting to be done!

*The post has been updated to be compatible with Tensorflow version 0.9 and above.

machine learning – Monik's Blog