Implementing char-RNN from Scratch in PyTorch, and Generating Fake Book Titles

This week, I implemented a character-level recurrent neural network (or char-rnn for short) in PyTorch, and used it to generate fake book titles. The code, training data, and pre-trained models can be found on my GitHub repo.

 
Heart in the Dark
Me the Bean
Be the Life
Yours
 

Model Overview

 
Diagram of the char-rnn network architecture. Source.

Diagram of the char-rnn network architecture. Source.

 

The char-rnn language model is a recurrent neural network that makes predictions on the character level. In contrast, many language models operate on the word level.

Making character-level predictions can be a bit more chaotic, but might be better for making up fake words (e.g. Harry Potter spells, band names, fake slang, fake cities, fantasy terms, etc.). Word-level language models might have an advantage for generating longer pieces of text, like summaries or fiction, as they don’t need to figure out how to spell, in a sense.

There do exist character-word hybrid approaches. For example, the GPT-2 model uses byte pair encoding, an approach that interpolates between the word-level for common sequences and the character-level for rare sequences.

This particular char-rnn implementation is set up to handle multiple categories of text. In this use case, it is able to make predictions for different book genres, e.g. Romance, Fantasy, Young Adult, etc.

Training Data

The training data used for this model is a modified version of a Goodreads data scrape of 20K book titles. I transformed the CSV file into separate text files for the top 30 genres. The resulting split dataset can be found in my Github repo.

GPU training time with this model took about 20 minutes on an NVIDIA GeForce GTX 1080 Ti. Generating samples only takes a few seconds.

Results

The following results are a selected sampling of outputs. Note that I’m mainly including examples that consist of real words, with a few exceptions.

Romance

Heart in the Dark
Years of the Dark
You the Book
The Stove to the Story

Fantasy

Growing the Dark
Book of the Dark
Red Sande

Fiction

In the Bead Store
Jen the Bead
King the Bean

Historical

A to the Bean
Other and Story

Science Fiction

Darke Sers
Voringe
In the Beantire

Mystery

Bed Singe
Kiss of the Dark
Red Story

Classics

A Mander of the Suckers
Gorden the Story of Merica

Childrens

Dark Book of the Story of the Sures of the Surating
Late
Story of the Bean

Paranormal

A Store of the Store
Red Store
Stariss and Storiss
Wind Store

New Adult

Live Me Life
Growing Me
In the Bean
Me the Bean

Poetry

Yours
Me

Erotica

Volle the Story of Men
King of the Dark
Dork of the Dark
Work of the Dark
Bed Storys of the Dark
Your Mind

Biography

Be the Life
On Anger and Of Mand Anger

Comically, there are many book titles that revolve around beans, beads, stores, and darkness. While I did notice some subtle differences between genres, it doesn’t appear to be particularly drastic overall.

joke2punchline, punchline2joke: Using a Seq2Seq Neural Network to "Translate" Between Jokes and Punchlines

 
> what do you call an unpredictable chef ?
< ouch .
 

After implementing the seq2seq model, an encoder-decoder network with attention, I wanted to get it to translate between jokes and punchlines. The scripts, pre-trained models, and training data can be found on my GitHub repo.

Model Overview

The underlying model is a PyTorch implementation of the Sequence to Sequence model network, an encoder-decoder network with an attention mechanism. Seq2seq can translate any arbitrary text sequence to any arbitrary text sequence. A more useful application, for example, would be translating English to French or vice versa. For this project, I trained the seq2seq model on question-answer format jokes, so that it can output a punchline given a joke, or output a joke given a punchline.

Results

Overall, the results were somewhat nonsensical, as one might expect. These results are curated by me based on whether or not they made me, at minimum, smile. Yes, I do laugh at my own jokes.

For the following examples, > represents the text input, < represents the model output. I’ve selected examples where the joke or punchline is not directly from the training set, i.e. excluding any output that is simply being regurgitated from the original dataset.

Joke2Punchline

For the following examples, the first line is a fake joke I wrote up using words within the model’s joke vocabulary and fed into the model (>), and the second line is the punchline outputted by the model (<).

 
> what do you call an unpredictable chef ?
< ouch .

> what do you call a pile of pillowcases ?
< screw music

> why was the sun hospitalized ?
< because he was sitting on me .

> what do you call an unhappy star ?
< stay here !

> what do you call an unhappy star ?
< days numbered !

> what is a strawberry s favorite month ?
< a cat !

> who s there ?
< in the dictionary .

> what is red and bad for your teeth ?
< a a gummy bear

> what treasure can you find on a thursday ?
< the lettuce !

> when is a sun a moon ?
< a barber driver

> how many bananas can the moon eat ?
< fish and the train .

> what do you call an upside down egg ?
< the dough

> why was the sun unhappy ?
< because he wanted to tearable time paste !

> what did the skeleton say when they died the wrong year ?
< it march

> how many snails does it take to get to the moon ?
< to the hot hot shakespeare !

> why was the moon crying ?
< because he was on the deck !

> where do sheep go to school ?
< they take the mile bison of course !

> how many emotions does the sun have ?
< he got cents
 

Punchline2Joke

For the following examples, I fed the model fake punchlines, written using words within the model’s punchline vocabulary, and the model outputted a joke that would result in the input punchline. The first line is the fake punchline I fed into the model (>), and the second line is the joke outputted by the model (<).

 
> two parents
< what has four wheels and flies over the world ?

> watermelon concentrate
< when do you stop at green and go at the orange juice factory ?

> cool space
< what do you call an alligator in a vest with a scoop of ice cream ?

> meteor milk
< what do you call a cow that is crossing ?

> one two three four
< what did the buffalo say to the bartender ?

> jalapeno ketchup
< what do you call a boy with no socks on ?

> ice cream salad !
< what did the fish say to the younger chimney ?

> the impossible !
< what did the worker say when he swam into the wall ?

> both !
< what do you call a ghosts mom and dad ?

> pasta party
< what do you call the sound a dog makes ?

> salad party
< what did the buffalo say to the patella ?

> dreams party
< what do you call the sound with a fever ?

> a thesaurus and a dictionary
< what kind of shorts do all spies wear ?

 

Considerations

Training Data

To train the model, I needed a dataset of clean jokes in question-answer text format.

While I did find a dataset of question-answer format jokes, the jokes are scraped from Reddit’s r/jokes subreddit. Going through the file, I did not like most of the jokes at all, as most of them were highly problematic. They were often racist, sexist, queerphobic, etc., and I would rather compile my own than to feed bad data into my model.

One option would be to filter this dataset using a set of “bad” keywords, but trying to filter a heavily biased dataset was less appealing to me than to create a new set entirely. An alternative could be to write a scraper for r/cleanjokes, filtering in only question-answer format jokes, but I didn’t want to invest too much time/energy on this toy project, and I personally am not a fan of using Reddit for training data in general.

I ended up compiling my own small dataset of clean jokes in the question-answer format, consisting of a little over 500 jokes total. A major trade-off was that the model’s vocabulary is relatively limited, but I enjoyed the jokes much more and felt much better about the data I was feeding into the model.

Teacher Forcing

For the joke2punchline and punchline2joke models, the teacher forcing ratio was set to 0.5. I’d be curious to adjust this parameter and see the results. I would expect a lower ratio to result in more nonsensical output, whereas a higher ratio would likely result in more outputs that are directly from the training set.

I think an ideal setup would be to lower the teacher forcing ratio in addition to having a much larger training set.

Possible Extensions

I do think it would be fun to generate jokes and punchlines using an RNN or LSTM before feeding it into these models, such that there is less human intervention (i.e. writing fake jokes/punchlines manually).

I also think the model would be way more fun to play with if it I could train it with a much larger dataset, i.e. 10K+ jokes.

Implementing a Seq2Seq Neural Network with Attention for Machine Translation from Scratch using PyTorch

Continuing with PyTorch implementation projects, last week I used this PyTorch tutorial to implement the Sequence to Sequence model network, an encoder-decoder network with an attention mechanism, used on a French to English translation task (and vice versa). The script, pre-trained model, and training data can be found on my GitHub repo.

In the following example, the first line (>) is the French input, the second line (=) is the English ground truth, and the third line (<) is the resulting English translation output from the model.

 
> je n appartiens pas a ce monde .
= i m not from this world .
< i m not from this world .
 

Model Overview

In this particular PyTorch implementation, the network comprises of 3 main components:

  • an encoder, which encodes the input text into a vector representation. For this project, the encoder is a recurrent neural network using gated recurrent units (GRUs). For each input word, the encoder will output a vector and a hidden state, and uses the hidden state for the next input word.

  • attention, a set of weights that is used during decoding. Attention weights are calculated using a simple feed-forward layer with softmax.

  • a decoder, which takes the encoder output and attention weights to generate a prediction for the next word. In this project, the decoder is a recurrent neural network using GRUs that starts off using the encoder’s last hidden state, which can be interpreted as a context vector for the input, and a start-of-sentence token. For each next word, the decoder uses the attention weights of the current token and the current hidden state to make a prediction with softmax.

The training data comes from the Tatoeba Project and comprises of language pairs within a text file. While this model uses the French-to-English data file, the file can easily be replaced with any other language pair file from the collection. As a caveat, I have not tested this on languages that may use different encodings (e.g. Traditional Chinese, Arabic, etc.)

Results

In the following examples, the first line (>) is the French input, the second line (=) is the English ground truth, and the third line (<) is the resulting English translation output from the model.

Overall, the results are fairly decent considering the small size of the training set, a 9MB text file, or about 1,400 language pairs.

 
> je suis impatient de la prochaine fois .
= i m looking forward to the next time .
< i m looking forward to the next time .

> je n appartiens pas a ce monde .
= i m not from this world .
< i m not from this world .

> il enseigne depuis ans .
= he s been teaching for years .
< he is been for for years .

> tu es sauve .
= you re safe .
< you re safe .

> je ne suis souvent qu a moitie reveille .
= i m often only half awake .
< i m still only sure .

> nous sommes reconnaissantes .
= we re grateful .
< we re contented .

> j en ai marre de garder des secrets .
= i m tired of keeping secrets .
< i m tired of hearing tom s .

> vous etes a nouveau de retour .
= you re back again .
< you re back again .

> il n est pas marie .
= he s not married .
< he s not married .

> je suis responsable des courses .
= i m in charge of shopping .
< i m very one .
 

Potential Extensions

Overall, this project serves mainly as a toy example and could easily be extended for better performance.

  • Training the model on other languages would be relatively straightforward as it would mainly be a matter of switching out the text file used for training data.

  • The embeddings can be replaced with other word embedding approaches, e.g. word2vec or GloVe.

  • There are various approaches for calculating attention weights. This implementation uses softmax. Experimenting with cosine (the angle between vectors) or dot product (considers both the angle and the magnitude for two vectors) could potentially produce different results.

  • Experimenting with longer training times, bigger datasets, and parameter tuning would likely yield better results.

Personally, I’m interested in running this network on translating jokes2punchlines and punchlines2jokes. My next steps are to acquire or compile a dataset of jokes with a question-answer format to train a seq2seq model. :-)

Text Generation with GPT-2, OpenAI's Recently Released Language Model

Venus, planet of love Was destroyed by global warming, while the other suns have been ravaged by the tides of time. There are no suns and there are no tides, except the sun itself. A few suns exist now in Mars (the moon is now in the orbit of Sol and Jupiter), but they are gone in the future. (It is implied that we can go back home to Earth) Jupiter is the only other planet that is not the source; that planet would be the nearest known red planet to us. So is Earth.

The Earth's only visible source of energy is the sun itself. (In Greek it means "sun" or "heaven.") According to the Old Testament story, Jupiter was so cold that it was able to cause the death of children when they died in a ship. The only real star in the solar system that is capable of causing death is the sun, which must be one of the most powerful stars in the universe. Only the moon can cause death from its star at once, and Venus must be at least one of the most powerful star systems in the entire galaxy (more details here). Earth was never seen as an "open" planet.

Earlier this month, OpenAI released a new text generation model, called GPT-2. GPT-2 stands for “Generative Pre-Training 2”: generative, because we are generating text; pre-training, because instead of training the model for any one specific task, we’re using unsupervised “pre-training” such that the general model can perform on a variety of tasks; and 2, because it’s the second model using this approach, following the first GPT model.

TLDR: The model is pretty good at generating fiction and fantasy, but it’s bad at math and at telling jokes. Skip to the end for my favorite excerpts.

Model Overview

The GPT-2 model uses conditional probability language modeling with a Transformer neural network architecture that relies on self-attention mechanisms (inspired by attention mechanisms from image processing tasks) in lieu of recurrence or convolution. (Side note: interesting to see how advancements in neural networks for image and language processing co-evolve.)

The model is trained on about 8 million documents, or about 40 GB of text, from web pages. The dataset, scraped for this model, is called WebText, and is the result of scraping outbound links from Reddit with at least 3 karma. (Some thoughts on this later. See section on “Training Data”)

In the original GPT model, the unsupervised pre-training was used as an initial step, followed by a supervised fine-tuning step for various tasks, such as question answering. GPT-2, however, is assessed using only the pre-training step, without the supervised fine-tuning. In other words, the model performs well in a zero shot setting.

First Impressions

When I first saw the blog post, I was both very impressed and also highly skeptical of the results.


Read More

Computational Creativity

I gave a presentation this week about some applications of artificial neural networks in computational creativity. It consists of an overview and discussion of 3 different papers:

  1. A Computational Model of Poetic Creativity with Neural Network as Measure of Adaptive Fitness

  2. A Neural Algorithm of Artistic Style

  3. What Happens Next? Event Prediction Using a Compositional Neural Network Model (part of the What-If Machine project)


Here are the slides: