RECURRENT NEURAL NETWORKS¶

IN THIS ARTICLE WE'RE GOING TO TRY AND CLASSIFY A MOVIE REVIEW AS EITHER POSITIVE OR NEATIVE¶

WE'RE GOING TO FOLLOW THE MAP BELOW TO ACHIEVE OUR GOAL¶

Get the dataset
Preprocessing the Data
Build the Model
Train the model
Test the Model
Predict Something

#Importing tools 
from keras.datasets import imdb
import numpy as np
import pandas as ps
import matplotlib.pyplot as plt 
import seaborn as sns
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Embedding, Conv1D, GlobalMaxPooling1D

GETTING THE DATA¶

For this task we will be using the Large Movie View Dataset . This dataset contains a collection of 50,000 reviews from IMDb and contains an even number of positive and negative reviews. A negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10. Neutral reviews are not included in the dataset. The dataset is divided into training and test sets. The training set is the same 25,000 labeled reviews. Conveniently, Keras has a built-in IMDb movie reviews data set that we can use.

#laod the data
df = imdb.load_data(num_words = 5000)

SPLIT THE DATA

(X_train, Y_train), (X_test, Y_test) = df

DISPLAY SOME SAMPLE.

print('+=++=+++=review++=+=+++===+')
print(X_train[6])
print('++++++++Labels+++++++++++')
print(Y_train[6])

+=++=+++=review++=+=+++===+
[1, 2, 365, 1234, 5, 1156, 354, 11, 14, 2, 2, 7, 1016, 2, 2, 356, 44, 4, 1349, 500, 746, 5, 200, 4, 4132, 11, 2, 2, 1117, 1831, 2, 5, 4831, 26, 6, 2, 4183, 17, 369, 37, 215, 1345, 143, 2, 5, 1838, 8, 1974, 15, 36, 119, 257, 85, 52, 486, 9, 6, 2, 2, 63, 271, 6, 196, 96, 949, 4121, 4, 2, 7, 4, 2212, 2436, 819, 63, 47, 77, 2, 180, 6, 227, 11, 94, 2494, 2, 13, 423, 4, 168, 7, 4, 22, 5, 89, 665, 71, 270, 56, 5, 13, 197, 12, 161, 2, 99, 76, 23, 2, 7, 419, 665, 40, 91, 85, 108, 7, 4, 2084, 5, 4773, 81, 55, 52, 1901]
++++++++Labels+++++++++++
1

THE DATA has been already pre-processed. All words in a review have been mapped to integers. These integers represent the words sorted by their frequency. So 4 represents the 4th most used word, 5 the 5th most used word and so on. The integer 1 is reserved for the start marker, the integer 2 for an unknown word and 0 for padding. The label is also an integer (0 for negative, 1 for positive).

PREPROCESSING THE DATA¶

In order to feed this data into our RNN, all input documents must have the same length. Since the reviews differ heavily in terms of lengths we need to trim each review to its first 500 words. If reviews are shorter than 500 words we will need to pad them with zeros. Keras, offers a set of preprocessing routines that can easily do this for us. In order to pad our reviews we will need to use the pad_sequences function.

X_train = sequence.pad_sequences(X_train, maxlen=500)#we go be using the pad_sequences method from the sequence class

X_test = sequence.pad_sequences(X_test, maxlen=500)#we go be using the pad_sequences method from the sequence class

X_test.shape

(25000, 500)

X_train.shape#check their shapes

(25000, 500)

BUILDING THE MODEL¶

MODEL PARAMETERS

input_dim: 5000 This is the size of the vocabulary in the text data

output_dim: 32 for this task we're going to use 32 but one can try different value

input_length: 500 This is the length of input sequences

model = Sequential()
model.add(Embedding(5000,16, input_length=500))

model.add(Conv1D(128,3, padding='valid', activation='relu', strides=1))
model.add(GlobalMaxPooling1D())
model.add(Dense(128, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

COMPILATION AND TRAINING OF OUR MODEL¶

WE WILL BE PLAYING AROUND WITH THE PARAMETERS BELOW FOR BETTER RESULTS OF OUR MODEL(ITS A PITTY THAT I CAN'T SHOW YOU ALL THE TROUBLES THAT I WENT TROUGH)

PARAMETERS

Number of filters
Filter size
Epochs
Padding mode
Strides

#compile the model and adjust parameter
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
hist = model.fit(X_train, Y_train,validation_data=(X_test, Y_test), epochs=10, batch_size=64)

Train on 25000 samples, validate on 25000 samples
Epoch 1/10
25000/25000 [==============================] - 78s 3ms/step - loss: 0.4564 - acc: 0.7686 - val_loss: 0.3065 - val_acc: 0.8696
Epoch 2/10
25000/25000 [==============================] - 74s 3ms/step - loss: 0.2442 - acc: 0.9003 - val_loss: 0.2761 - val_acc: 0.8835
Epoch 3/10
25000/25000 [==============================] - 73s 3ms/step - loss: 0.1733 - acc: 0.9329 - val_loss: 0.3368 - val_acc: 0.8638
Epoch 4/10
25000/25000 [==============================] - 73s 3ms/step - loss: 0.1154 - acc: 0.9590 - val_loss: 0.3145 - val_acc: 0.8828
Epoch 5/10
25000/25000 [==============================] - 75s 3ms/step - loss: 0.0678 - acc: 0.9801 - val_loss: 0.3948 - val_acc: 0.8730
Epoch 6/10
25000/25000 [==============================] - 72s 3ms/step - loss: 0.0377 - acc: 0.9898 - val_loss: 0.4536 - val_acc: 0.8743
Epoch 7/10
25000/25000 [==============================] - 72s 3ms/step - loss: 0.0160 - acc: 0.9970 - val_loss: 0.5005 - val_acc: 0.8747
Epoch 8/10
25000/25000 [==============================] - 72s 3ms/step - loss: 0.0056 - acc: 0.9996 - val_loss: 0.5396 - val_acc: 0.8760
Epoch 9/10
25000/25000 [==============================] - 72s 3ms/step - loss: 0.0022 - acc: 0.9999 - val_loss: 0.5679 - val_acc: 0.8756
Epoch 10/10
25000/25000 [==============================] - 71s 3ms/step - loss: 0.0013 - acc: 1.0000 - val_loss: 0.5957 - val_acc: 0.8755

print(hist.history['acc'])
print(hist.history['val_acc'])

[0.768559999961853, 0.900279999961853, 0.9329199999809266, 0.9589999999809266, 0.980080000038147, 0.9898000000381469, 0.997, 0.99956, 0.99992, 0.99996]
[0.869600000038147, 0.88348, 0.863760000038147, 0.8828, 0.873040000038147, 0.87432, 0.87468, 0.87604, 0.87556, 0.87548]

#plot the results at each epoch
train_accuracies = hist.history['acc'] 
test_accuracies = hist.history['val_acc']
plt.figure()
plt.plot( train_accuracies, c='r', label='acc')
plt.plot( test_accuracies, c='b', label='val_acc')
plt.legend()

<matplotlib.legend.Legend at 0x28f33d319b0>

Hum! why is the accuracy increasing but but not the validation accuracy (red flag)¶

#evaluate the model>>>>> not too bad
scores = model.evaluate(X_test, Y_test)
print('Test accuracy:', scores[1])

Test accuracy: 0.88036

NOW LET'S DO SOME PREDICTION AND RUN CONFUSION ,¶

prediction = model.predict(X_test)
y_pred = (prediction > 0.5)
from sklearn.metrics import f1_score, confusion_matrix
print('F1-score: {0}'.format(f1_score(y_pred, Y_test)))
print('Confusion matrix:')
confusion_matrix(y_pred, Y_test)

F1-score: 0.8756838784393595
Confusion matrix:

array([[10923,  1536],
       [ 1577, 10964]], dtype=int64)