What's in a (sub)word?

In this colab, we'll work with subwords, or words made up of the pieces of larger words, and see how that impacts our network and related embeddings.

In [1]:
import tensorflow as tf

from tensorflow.keras.preprocessing.sequence import pad_sequences

Get the original dataset

We'll once again use the dataset containing Amazon and Yelp reviews. This dataset was originally extracted from here.

In [2]:
!wget --no-check-certificate \
    https://drive.google.com/uc?id=13ySLC_ue6Umt9RJYSeM2t-V0kCv-4C-P \
    -O /tmp/sentiment.csv
--2020-08-08 19:23:05--  https://drive.google.com/uc?id=13ySLC_ue6Umt9RJYSeM2t-V0kCv-4C-P
Resolving drive.google.com (drive.google.com)... 172.217.218.139, 172.217.218.113, 172.217.218.100, ...
Connecting to drive.google.com (drive.google.com)|172.217.218.139|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://doc-08-ak-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/r5thfn1geqg5cnatq2k6oecuqrjdndq2/1596914550000/11118900490791463723/*/13ySLC_ue6Umt9RJYSeM2t-V0kCv-4C-P [following]
Warning: wildcards not supported in HTTP.
--2020-08-08 19:23:05--  https://doc-08-ak-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/r5thfn1geqg5cnatq2k6oecuqrjdndq2/1596914550000/11118900490791463723/*/13ySLC_ue6Umt9RJYSeM2t-V0kCv-4C-P
Resolving doc-08-ak-docs.googleusercontent.com (doc-08-ak-docs.googleusercontent.com)... 108.177.127.132, 2a00:1450:4013:c07::84
Connecting to doc-08-ak-docs.googleusercontent.com (doc-08-ak-docs.googleusercontent.com)|108.177.127.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 127831 (125K) [text/csv]
Saving to: ‘/tmp/sentiment.csv’

/tmp/sentiment.csv  100%[===================>] 124.83K  --.-KB/s    in 0.001s  

2020-08-08 19:23:06 (91.0 MB/s) - ‘/tmp/sentiment.csv’ saved [127831/127831]

In [3]:
import pandas as pd

dataset = pd.read_csv('/tmp/sentiment.csv')

# Just extract out sentences and labels first - we will create subwords here
sentences = dataset['text'].tolist()
labels = dataset['sentiment'].tolist()

Create a subwords dataset

We can use the existing Amazon and Yelp reviews dataset with tensorflow_datasets's SubwordTextEncoder functionality. SubwordTextEncoder.build_from_corpus() will create a tokenizer for us. You could also use this functionality to get subwords from a much larger corpus of text as well, but we'll just use our existing dataset here.

The Amazon and Yelp dataset we are using isn't super large, so we'll create a subword vocab_size of only the 1,000 most common words, as well as cutting off each subword to be at most 5 characters.

Check out the related documentation here.

In [4]:
import tensorflow_datasets as tfds

vocab_size = 1000
tokenizer = tfds.features.text.SubwordTextEncoder.build_from_corpus(sentences, vocab_size, max_subword_length=5)
In [5]:
# Check that the tokenizer works appropriately
num = 5
print(sentences[num])
encoded = tokenizer.encode(sentences[num])
print(encoded)
# Separately print out each subword, decoded
for i in encoded:
  print(tokenizer.decode([i]))
I have to jiggle the plug to get it to line up right to get decent volume.
[4, 31, 6, 849, 162, 450, 12, 1, 600, 438, 775, 6, 175, 14, 6, 55, 213, 159, 474, 775, 6, 175, 614, 380, 295, 148, 72, 789]
I 
have 
to 
j
ig
gl
e 
the 
pl
ug
 
to 
get 
it 
to 
li
ne 
up 
right
 
to 
get 
dec
ent 
vo
lu
me
.

Replace sentence data with encoded subwords

Now, we'll re-create the dataset to be used for training by actually encoding each of the individual sentences. This is equivalent to text_to_sequences with the Tokenizer we used in earlier exercises.

In [6]:
for i, sentence in enumerate(sentences):
  sentences[i] = tokenizer.encode(sentence)
In [7]:
# Check the sentences are appropriately replaced
print(sentences[1])
[625, 677, 626, 274, 380, 633, 148, 844, 789]

Final pre-processing

Before training, we still need to pad the sequences, as well as split into training and test sets.

In [8]:
import numpy as np

max_length = 50
trunc_type='post'
padding_type='post'

# Pad all sentences
sentences_padded = pad_sequences(sentences, maxlen=max_length, 
                                 padding=padding_type, truncating=trunc_type)

# Separate out the sentences and labels into training and test sets
training_size = int(len(sentences) * 0.8)

training_sentences = sentences_padded[0:training_size]
testing_sentences = sentences_padded[training_size:]
training_labels = labels[0:training_size]
testing_labels = labels[training_size:]

# Make labels into numpy arrays for use with the network later
training_labels_final = np.array(training_labels)
testing_labels_final = np.array(testing_labels)

Train a Sentiment Model

In [9]:
embedding_dim = 16
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(6, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, 50, 16)            16000     
_________________________________________________________________
global_average_pooling1d (Gl (None, 16)                0         
_________________________________________________________________
dense (Dense)                (None, 6)                 102       
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 7         
=================================================================
Total params: 16,109
Trainable params: 16,109
Non-trainable params: 0
_________________________________________________________________
In [10]:
num_epochs = 30
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
history = model.fit(training_sentences, training_labels_final, epochs=num_epochs, 
                    validation_data=(testing_sentences, testing_labels_final))
Epoch 1/30
50/50 [==============================] - 0s 7ms/step - loss: 0.6924 - accuracy: 0.5028 - val_loss: 0.6949 - val_accuracy: 0.4286
Epoch 2/30
50/50 [==============================] - 0s 4ms/step - loss: 0.6873 - accuracy: 0.5556 - val_loss: 0.6929 - val_accuracy: 0.4762
Epoch 3/30
50/50 [==============================] - 0s 3ms/step - loss: 0.6808 - accuracy: 0.6033 - val_loss: 0.6872 - val_accuracy: 0.5338
Epoch 4/30
50/50 [==============================] - 0s 4ms/step - loss: 0.6708 - accuracy: 0.6610 - val_loss: 0.6796 - val_accuracy: 0.5689
Epoch 5/30
50/50 [==============================] - 0s 3ms/step - loss: 0.6558 - accuracy: 0.7087 - val_loss: 0.6692 - val_accuracy: 0.5815
Epoch 6/30
50/50 [==============================] - 0s 4ms/step - loss: 0.6356 - accuracy: 0.7476 - val_loss: 0.6459 - val_accuracy: 0.7068
Epoch 7/30
50/50 [==============================] - 0s 4ms/step - loss: 0.6092 - accuracy: 0.7740 - val_loss: 0.6296 - val_accuracy: 0.6992
Epoch 8/30
50/50 [==============================] - 0s 4ms/step - loss: 0.5782 - accuracy: 0.7972 - val_loss: 0.6071 - val_accuracy: 0.7343
Epoch 9/30
50/50 [==============================] - 0s 4ms/step - loss: 0.5465 - accuracy: 0.8249 - val_loss: 0.5954 - val_accuracy: 0.7093
Epoch 10/30
50/50 [==============================] - 0s 3ms/step - loss: 0.5121 - accuracy: 0.8293 - val_loss: 0.5680 - val_accuracy: 0.7719
Epoch 11/30
50/50 [==============================] - 0s 4ms/step - loss: 0.4798 - accuracy: 0.8431 - val_loss: 0.5463 - val_accuracy: 0.7870
Epoch 12/30
50/50 [==============================] - 0s 4ms/step - loss: 0.4488 - accuracy: 0.8600 - val_loss: 0.5430 - val_accuracy: 0.7519
Epoch 13/30
50/50 [==============================] - 0s 4ms/step - loss: 0.4205 - accuracy: 0.8619 - val_loss: 0.5420 - val_accuracy: 0.7293
Epoch 14/30
50/50 [==============================] - 0s 3ms/step - loss: 0.3951 - accuracy: 0.8719 - val_loss: 0.5197 - val_accuracy: 0.7594
Epoch 15/30
50/50 [==============================] - 0s 3ms/step - loss: 0.3717 - accuracy: 0.8738 - val_loss: 0.5077 - val_accuracy: 0.7569
Epoch 16/30
50/50 [==============================] - 0s 3ms/step - loss: 0.3507 - accuracy: 0.8845 - val_loss: 0.5050 - val_accuracy: 0.7544
Epoch 17/30
50/50 [==============================] - 0s 4ms/step - loss: 0.3335 - accuracy: 0.8807 - val_loss: 0.5164 - val_accuracy: 0.7444
Epoch 18/30
50/50 [==============================] - 0s 4ms/step - loss: 0.3154 - accuracy: 0.8945 - val_loss: 0.4998 - val_accuracy: 0.7569
Epoch 19/30
50/50 [==============================] - 0s 4ms/step - loss: 0.3006 - accuracy: 0.9033 - val_loss: 0.4958 - val_accuracy: 0.7519
Epoch 20/30
50/50 [==============================] - 0s 3ms/step - loss: 0.2855 - accuracy: 0.9090 - val_loss: 0.5263 - val_accuracy: 0.7393
Epoch 21/30
50/50 [==============================] - 0s 3ms/step - loss: 0.2754 - accuracy: 0.9109 - val_loss: 0.5227 - val_accuracy: 0.7494
Epoch 22/30
50/50 [==============================] - 0s 4ms/step - loss: 0.2631 - accuracy: 0.9090 - val_loss: 0.5158 - val_accuracy: 0.7494
Epoch 23/30
50/50 [==============================] - 0s 4ms/step - loss: 0.2516 - accuracy: 0.9140 - val_loss: 0.4993 - val_accuracy: 0.7619
Epoch 24/30
50/50 [==============================] - 0s 3ms/step - loss: 0.2421 - accuracy: 0.9222 - val_loss: 0.5272 - val_accuracy: 0.7444
Epoch 25/30
50/50 [==============================] - 0s 3ms/step - loss: 0.2328 - accuracy: 0.9253 - val_loss: 0.5135 - val_accuracy: 0.7519
Epoch 26/30
50/50 [==============================] - 0s 4ms/step - loss: 0.2258 - accuracy: 0.9215 - val_loss: 0.5395 - val_accuracy: 0.7419
Epoch 27/30
50/50 [==============================] - 0s 3ms/step - loss: 0.2153 - accuracy: 0.9303 - val_loss: 0.5309 - val_accuracy: 0.7519
Epoch 28/30
50/50 [==============================] - 0s 4ms/step - loss: 0.2088 - accuracy: 0.9341 - val_loss: 0.5453 - val_accuracy: 0.7419
Epoch 29/30
50/50 [==============================] - 0s 3ms/step - loss: 0.2013 - accuracy: 0.9397 - val_loss: 0.5778 - val_accuracy: 0.7293
Epoch 30/30
50/50 [==============================] - 0s 4ms/step - loss: 0.1938 - accuracy: 0.9341 - val_loss: 0.5421 - val_accuracy: 0.7619

Visualize the Training Graph

We can visualize the training graph below again. Does there appear to be a difference in how validation accuracy and loss is trending compared to with full words?

In [11]:
import matplotlib.pyplot as plt


def plot_graphs(history, string):
  plt.plot(history.history[string])
  plt.plot(history.history['val_'+string])
  plt.xlabel("Epochs")
  plt.ylabel(string)
  plt.legend([string, 'val_'+string])
  plt.show()
  
plot_graphs(history, "accuracy")
plot_graphs(history, "loss")

Get files for visualizing the network

Once again, you can visualize the sentiment related to all of the subwords using the below code and by heading to http://projector.tensorflow.org/ to upload and view the data.

Note that the below code does have a few small changes to handle the different way text is encoded in our dataset compared to before with the built in Tokenizer.

You may get an error like "Number of tensors (999) do not match the number of lines in metadata (992)." As long as you load the vectors first without error and wait a few seconds after this pops up, you will be able to click outside the file load menu and still view the visualization.

In [12]:
# First get the weights of the embedding layer
e = model.layers[0]
weights = e.get_weights()[0]
print(weights.shape) # shape: (vocab_size, embedding_dim)
(1000, 16)
In [13]:
import io

# Write out the embedding vectors and metadata
out_v = io.open('vecs.tsv', 'w', encoding='utf-8')
out_m = io.open('meta.tsv', 'w', encoding='utf-8')
for word_num in range(0, vocab_size - 1):
  word = tokenizer.decode([word_num])
  embeddings = weights[word_num]
  out_m.write(word + "\n")
  out_v.write('\t'.join([str(x) for x in embeddings]) + "\n")
out_v.close()
out_m.close()
In [14]:
# Download the files
try:
  from google.colab import files
except ImportError:
  pass
else:
  files.download('vecs.tsv')
  files.download('meta.tsv')
In [14]: