Implementing a Seq2Seq Neural Network with Attention for Machine Translation from Scratch using PyTorch

Continuing with PyTorch implementation projects, last week I used this PyTorch tutorial to implement the Sequence to Sequence model network, an encoder-decoder network with an attention mechanism, used on a French to English translation task (and vice versa). The script, pre-trained model, and training data can be found on my GitHub repo.

In the following example, the first line (>) is the French input, the second line (=) is the English ground truth, and the third line (<) is the resulting English translation output from the model.

 
> je n appartiens pas a ce monde .
= i m not from this world .
< i m not from this world .
 

Model Overview

In this particular PyTorch implementation, the network comprises of 3 main components:

  • an encoder, which encodes the input text into a vector representation. For this project, the encoder is a recurrent neural network using gated recurrent units (GRUs). For each input word, the encoder will output a vector and a hidden state, and uses the hidden state for the next input word.

  • attention, a set of weights that is used during decoding. Attention weights are calculated using a simple feed-forward layer with softmax.

  • a decoder, which takes the encoder output and attention weights to generate a prediction for the next word. In this project, the decoder is a recurrent neural network using GRUs that starts off using the encoder’s last hidden state, which can be interpreted as a context vector for the input, and a start-of-sentence token. For each next word, the decoder uses the attention weights of the current token and the current hidden state to make a prediction with softmax.

The training data comes from the Tatoeba Project and comprises of language pairs within a text file. While this model uses the French-to-English data file, the file can easily be replaced with any other language pair file from the collection. As a caveat, I have not tested this on languages that may use different encodings (e.g. Traditional Chinese, Arabic, etc.)

Results

In the following examples, the first line (>) is the French input, the second line (=) is the English ground truth, and the third line (<) is the resulting English translation output from the model.

Overall, the results are fairly decent considering the small size of the training set, a 9MB text file, or about 1,400 language pairs.

 
> je suis impatient de la prochaine fois .
= i m looking forward to the next time .
< i m looking forward to the next time .

> je n appartiens pas a ce monde .
= i m not from this world .
< i m not from this world .

> il enseigne depuis ans .
= he s been teaching for years .
< he is been for for years .

> tu es sauve .
= you re safe .
< you re safe .

> je ne suis souvent qu a moitie reveille .
= i m often only half awake .
< i m still only sure .

> nous sommes reconnaissantes .
= we re grateful .
< we re contented .

> j en ai marre de garder des secrets .
= i m tired of keeping secrets .
< i m tired of hearing tom s .

> vous etes a nouveau de retour .
= you re back again .
< you re back again .

> il n est pas marie .
= he s not married .
< he s not married .

> je suis responsable des courses .
= i m in charge of shopping .
< i m very one .
 

Potential Extensions

Overall, this project serves mainly as a toy example and could easily be extended for better performance.

  • Training the model on other languages would be relatively straightforward as it would mainly be a matter of switching out the text file used for training data.

  • The embeddings can be replaced with other word embedding approaches, e.g. word2vec or GloVe.

  • There are various approaches for calculating attention weights. This implementation uses softmax. Experimenting with cosine (the angle between vectors) or dot product (considers both the angle and the magnitude for two vectors) could potentially produce different results.

  • Experimenting with longer training times, bigger datasets, and parameter tuning would likely yield better results.

Personally, I’m interested in running this network on translating jokes2punchlines and punchlines2jokes. My next steps are to acquire or compile a dataset of jokes with a question-answer format to train a seq2seq model. :-)