You've already done some amazing work with generating new songs, but so far we've seen some issues with repetition and a fair amount of incoherence. By using more data and further tweaking the model, you'll be able to get improved results. We'll once again use the Kaggle Song Lyrics Dataset here.
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
# Other imports for processing data
import string
import numpy as np
import pandas as pd
As noted above, we'll utilize the Song Lyrics dataset on Kaggle again.
!wget --no-check-certificate \
https://drive.google.com/uc?id=1LiJFZd41ofrWoBtW-pMYsfz1w8Ny0Bj8 \
-O /tmp/songdata.csv
Now we've seen a model trained on just a small sample of songs, and how this often leads to repetition as you get further along in trying to generate new text. Let's switch to using the 250 songs instead, and see if our output improves. This will actually be nearly 10K lines of lyrics, which should be sufficient.
Note that we won't use the full dataset here as it will take up quite a bit of RAM and processing time, but you're welcome to try doing so on your own later. If interested, you'll likely want to use only some of the more common words for the Tokenizer, which will help shrink processing time and memory needed (or else you'd have an output array hundreds of thousands of words long).
def tokenize_corpus(corpus, num_words=-1):
# Fit a Tokenizer on the corpus
if num_words > -1:
tokenizer = Tokenizer(num_words=num_words)
else:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(corpus)
return tokenizer
def create_lyrics_corpus(dataset, field):
# Remove all other punctuation
dataset[field] = dataset[field].str.replace('[{}]'.format(string.punctuation), '')
# Make it lowercase
dataset[field] = dataset[field].str.lower()
# Make it one long string to split by line
lyrics = dataset[field].str.cat()
corpus = lyrics.split('\n')
# Remove any trailing whitespace
for l in range(len(corpus)):
corpus[l] = corpus[l].rstrip()
# Remove any empty lines
corpus = [l for l in corpus if l != '']
return corpus
def tokenize_corpus(corpus, num_words=-1):
# Fit a Tokenizer on the corpus
if num_words > -1:
tokenizer = Tokenizer(num_words=num_words)
else:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(corpus)
return tokenizer
# Read the dataset from csv - this time with 250 songs
dataset = pd.read_csv('/tmp/songdata.csv', dtype=str)[:250]
# Create the corpus using the 'text' column containing lyrics
corpus = create_lyrics_corpus(dataset, 'text')
# Tokenize the corpus
tokenizer = tokenize_corpus(corpus, num_words=2000)
total_words = tokenizer.num_words
# There should be a lot more words now
print(total_words)
sequences = []
for line in corpus:
token_list = tokenizer.texts_to_sequences([line])[0]
for i in range(1, len(token_list)):
n_gram_sequence = token_list[:i+1]
sequences.append(n_gram_sequence)
# Pad sequences for equal input length
max_sequence_len = max([len(seq) for seq in sequences])
sequences = np.array(pad_sequences(sequences, maxlen=max_sequence_len, padding='pre'))
# Split sequences between the "input" sequence and "output" predicted word
input_sequences, labels = sequences[:,:-1], sequences[:,-1]
# One-hot encode the labels
one_hot_labels = tf.keras.utils.to_categorical(labels, num_classes=total_words)
With more data, we'll cut off after 100 epochs to avoid keeping you here all day. You'll also want to change your runtime type to GPU if you haven't already (you'll need to re-run the above cells if you change runtimes).
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Bidirectional
model = Sequential()
model.add(Embedding(total_words, 64, input_length=max_sequence_len-1))
model.add(Bidirectional(LSTM(20)))
model.add(Dense(total_words, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
history = model.fit(input_sequences, one_hot_labels, epochs=100, verbose=1)
import matplotlib.pyplot as plt
def plot_graphs(history, string):
plt.plot(history.history[string])
plt.xlabel("Epochs")
plt.ylabel(string)
plt.show()
plot_graphs(history, 'accuracy')
This time around, we should be able to get a more interesting output with less repetition.
seed_text = "im feeling chills"
next_words = 100
for _ in range(next_words):
token_list = tokenizer.texts_to_sequences([seed_text])[0]
token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
predicted = np.argmax(model.predict(token_list), axis=-1)
output_word = ""
for word, index in tokenizer.word_index.items():
if index == predicted:
output_word = word
break
seed_text += " " + output_word
print(seed_text)
In running the above, you may notice that the same seed text will generate similar outputs. This is because the code is currently always choosing the top predicted class as the next word. What if you wanted more variance in the output?
Switching from model.predict_classes
to model.predict_proba
will get us all of the class probabilities. We can combine this with np.random.choice
to select a given predicted output based on a probability, thereby giving a bit more randomness to our outputs.
# Test the method with just the first word after the seed text
seed_text = "im feeling chills"
next_words = 100
token_list = tokenizer.texts_to_sequences([seed_text])[0]
token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
predicted_probs = model.predict(token_list)[0]
predicted = np.random.choice([x for x in range(len(predicted_probs))],
p=predicted_probs)
# Running this cell multiple times should get you some variance in output
print(predicted)
# Use this process for the full output generation
seed_text = "im feeling chills"
next_words = 100
for _ in range(next_words):
token_list = tokenizer.texts_to_sequences([seed_text])[0]
token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
predicted_probs = model.predict(token_list)[0]
predicted = np.random.choice([x for x in range(len(predicted_probs))],
p=predicted_probs)
output_word = ""
for word, index in tokenizer.word_index.items():
if index == predicted:
output_word = word
break
seed_text += " " + output_word
print(seed_text)