In this colab, we'll work with subwords, or words made up of the pieces of larger words, and see how that impacts our network and related embeddings.
import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences
!wget --no-check-certificate \
https://drive.google.com/uc?id=13ySLC_ue6Umt9RJYSeM2t-V0kCv-4C-P \
-O /tmp/sentiment.csv
import pandas as pd
dataset = pd.read_csv('/tmp/sentiment.csv')
# Just extract out sentences and labels first - we will create subwords here
sentences = dataset['text'].tolist()
labels = dataset['sentiment'].tolist()
We can use the existing Amazon and Yelp reviews dataset with tensorflow_datasets
's SubwordTextEncoder
functionality. SubwordTextEncoder.build_from_corpus()
will create a tokenizer for us. You could also use this functionality to get subwords from a much larger corpus of text as well, but we'll just use our existing dataset here.
The Amazon and Yelp dataset we are using isn't super large, so we'll create a subword vocab_size
of only the 1,000 most common words, as well as cutting off each subword to be at most 5 characters.
Check out the related documentation here.
import tensorflow_datasets as tfds
vocab_size = 1000
tokenizer = tfds.features.text.SubwordTextEncoder.build_from_corpus(sentences, vocab_size, max_subword_length=5)
# Check that the tokenizer works appropriately
num = 5
print(sentences[num])
encoded = tokenizer.encode(sentences[num])
print(encoded)
# Separately print out each subword, decoded
for i in encoded:
print(tokenizer.decode([i]))
Now, we'll re-create the dataset to be used for training by actually encoding each of the individual sentences. This is equivalent to text_to_sequences
with the Tokenizer
we used in earlier exercises.
for i, sentence in enumerate(sentences):
sentences[i] = tokenizer.encode(sentence)
# Check the sentences are appropriately replaced
print(sentences[1])
Before training, we still need to pad the sequences, as well as split into training and test sets.
import numpy as np
max_length = 50
trunc_type='post'
padding_type='post'
# Pad all sentences
sentences_padded = pad_sequences(sentences, maxlen=max_length,
padding=padding_type, truncating=trunc_type)
# Separate out the sentences and labels into training and test sets
training_size = int(len(sentences) * 0.8)
training_sentences = sentences_padded[0:training_size]
testing_sentences = sentences_padded[training_size:]
training_labels = labels[0:training_size]
testing_labels = labels[training_size:]
# Make labels into numpy arrays for use with the network later
training_labels_final = np.array(training_labels)
testing_labels_final = np.array(testing_labels)
embedding_dim = 16
model = tf.keras.Sequential([
tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
tf.keras.layers.GlobalAveragePooling1D(),
tf.keras.layers.Dense(6, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
model.summary()
num_epochs = 30
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
history = model.fit(training_sentences, training_labels_final, epochs=num_epochs,
validation_data=(testing_sentences, testing_labels_final))
We can visualize the training graph below again. Does there appear to be a difference in how validation accuracy and loss is trending compared to with full words?
import matplotlib.pyplot as plt
def plot_graphs(history, string):
plt.plot(history.history[string])
plt.plot(history.history['val_'+string])
plt.xlabel("Epochs")
plt.ylabel(string)
plt.legend([string, 'val_'+string])
plt.show()
plot_graphs(history, "accuracy")
plot_graphs(history, "loss")
Once again, you can visualize the sentiment related to all of the subwords using the below code and by heading to http://projector.tensorflow.org/ to upload and view the data.
Note that the below code does have a few small changes to handle the different way text is encoded in our dataset compared to before with the built in Tokenizer
.
You may get an error like "Number of tensors (999) do not match the number of lines in metadata (992)." As long as you load the vectors first without error and wait a few seconds after this pops up, you will be able to click outside the file load menu and still view the visualization.
# First get the weights of the embedding layer
e = model.layers[0]
weights = e.get_weights()[0]
print(weights.shape) # shape: (vocab_size, embedding_dim)
import io
# Write out the embedding vectors and metadata
out_v = io.open('vecs.tsv', 'w', encoding='utf-8')
out_m = io.open('meta.tsv', 'w', encoding='utf-8')
for word_num in range(0, vocab_size - 1):
word = tokenizer.decode([word_num])
embeddings = weights[word_num]
out_m.write(word + "\n")
out_v.write('\t'.join([str(x) for x in embeddings]) + "\n")
out_v.close()
out_m.close()
# Download the files
try:
from google.colab import files
except ImportError:
pass
else:
files.download('vecs.tsv')
files.download('meta.tsv')