Table of Contents:

  1. Introduction
  2. Some Basics
  3. Techniques used
  4. Data Preparation and Data Preprocessing
  5. Hyper-parameter selection and Model building
  6. Model Inference
  7. Room for Improvement

1. Introduction

On a lighter note, the embedding of a particular word (In Higher Dimension) is nothing but a vector representation of that word (In Lower Dimension). Where words with similar meaning Ex. “Joyful” and “Cheerful” and other closely related words like Ex. “Money” and “Bank”, gets closer vector representation when projected in the Lower Dimension.

The transformation from words to vectors is called word embedding

So the underlying concept in creating a mini word embedding boils down to train a simple Auto-Encoder with some text data.

2. Some Basics

Before we proceed to our creation of mini word embedding, it’s good to brush up our basics concepts of word embedding showered by the deep learning community so far.

The popular and state-of-the-art word embedding models out there are as follows:-

  1. Word2Vec (Google)
  2. Glove (Stanford University)

They are trained on a huge amount of text corpus like Wikipedia or entire web is scraped, up to 6 Billion words (In Higher Dimension), and projected them into as low as 100,200,300 dense embeddings (In Lower Dimension).

Here in our model, we project them into 2 dense embeddings.

3. Techniques used

The above state-of-the-art models use any one of the 2 primary techniques to accomplish the task.

1. Continous-Bag-of-Words (CBOW).

2. Skip-Gram.

1. CBOW : CBOW attempts to guess the output (target word) from its neighboring words (context words). Window size is a hyper-parameter here.

Example →

Sentence: cats and mice are buddies

Target Word(Output): mice (let’s say)

Context Word(Inputs): cats and _ are buddies

2. Skip-Gram:Skip-Gram guesses the context words from a target word. We will be implementing this in this post.

Example →

Sentence: cats and mice are buddies

Target Word(Output): and, mice …

Context Word(Inputs): cat, cat …


CBOW Vs Skip-gram
CBOW Vs Skip-gram

4. Data Preparation and Data Preprocessing

Here comes the fun part, as I stated before the above state of the models used a large amount of text data to train those models, since we are interested in a mini version of it, let’s choose a small dataset.

And to make things exciting, I have chosen Tom and Jerry cartoon play as our data corpus.

Tom and Jerry — Play

Our mini dataset looks like this,

So we will be using the above data, now we shall start the pre-processing steps.

1. First, we need to map each unique word into an integer and later map the integer into one-hot encoding.

Data — Preprocess

2. Then once we have made the integer and one hot mapping for every word, now we shall create batches for training.

Since we have limited data and implementing a mini word embedding, we shall consider the skip-gram model with the window size of 2 (Consider the adjacent 2 words as targets) and predict the target word, given the context word (INPUT).

Refer to the picture below to understand our skip-gram model.

Our training batch

Sample Data Format

The code implementation for the above batch preparation is shown below.

5. Hyper-parameter selection and Model building

Now that we are done creating our batches, now let’s build a simple Auto-Encoder type model for training. In simple words, it’s a neural network that compresses the higher dimension into a lower dimension and later decompresses it to a higher dimension.

So it is understood that the lower dimension captures the important features of the input, which in our case is our word embedding of the target word.

Auto-Encoder Design

Here from the design above, I have modified our neural network function to provide the output of the final layer (30D) and as well as the output of the middle layer (2D) [Our Word Embedding].

To design the neural network I will be using the PyTorch framework.

Hyper-Parameter Selection :

1. input_size = 30 (Input as well as Output Dimension)

2. hidden_size = 2 (Hidden Layer dimension)

3. learning_rate = 0.01 (lr for weight optimization)

4. num_epochs = 5000 (How many times to train the model on entire data)

Therefore for the above specifications, I have designed the model in Pytorch.

See the code implementation below.

Now let’s start the training process.


The loss graph looks good and out model doesn’t overfit nor underfit. Now let’s pass all our inputs and get the 2D [Word Embedding] ( Lower Dimension Representation) for the input words and plot them to see whether our model has learned the semantic meaning in our data corpus. You have to train it for more epochs if you have larger training data.

6. Model Inerence

Here we shall pass every word in our corpus and extract the 2D Latent representation learned by the model (Word Embedding).

Untrained Model’s Output

Trained Model’s Output

And we did it, just as we expected, we can see the words “Mice & Cat” are very close in the embedding dimension, this feature is learned from the data corpus as they occur very frequently one after another.

Also the words, “Buddies, Pals, and Chums”, “lives, sleep & house” and “catches, chases” are also closer in the embedding dimension.

The reason is that those bolded words carry some semantic meaning between them. For instance “Buddies, Pals, and Chums”, generally refer to the same meaning — Friends/partners and our model captured it.

Similarly, we know the words tom(cat) and jerry(mice) occurs frequently, so the model interprets there’s a relationship between them and projects them nearby in the latent dimension.

This is exactly what happens inside the word2vec model on a larger scale, but instead, it has different architecture (CBOW or Skip-gram with different window sizes & multiple target words) and it’s trained on high volume data.

7. Room for Improvement

1. Changing Model’s Architecture :

This model cannot capture features from a high volume data corpus, so we need to change the architecture of our model to perform the task.

In the above implementation, we used a single target word per input word for prediction, but it can be extended like the figure on the right, where the same neural network can be used to predict across multiple target words for the given input word, which makes the model capture nuances in the dataset.

Old Architecture

New Architecture

Since I want this post to be simple and unique across the other word embedding articles available on the medium, I used a single target word predictor model, but now you can easily comprehend other articles if you understand the underlying concept firmly.

2. Sub-Sampling:

In the above implementation, since we only had very less data (Vocab size < 30), we converted each word into One-Hot Encodings of length 30, with the value 1 for the corresponding word and the rest of them 0. Imagine we had 6 billion words to train and having applied the same concept would not be so useful. So one solution to minimize the issue is to do sub-sampling to remove the rare and frequent words.

In the corpus words such as “of, the, and, for, etc…” (Stop-words) don’t provide much context to the nearby words. Since we are interested to find the semantic meaning between words, we can safely discard them, thus removing the noise from the data and it in turn provides us greater accuracy, faster training, and better representations. This process is said to be subsampling.

And again to identify each word and remove them manually is a provoking task, hence we seek the help of probability, where we can actually calculate the probability of every word and set a threshold to either consider it or discard it.

Probability of a word

p(Wi) → Probability of that word to be discarded

(if 1 → Discard it, 0.3 →Keep)

t → Threshold parameter (say 0.001)

f(Wi) → Frequency of a word (Wi) in the total dataset

Frequency of a word = (Number of times that word appears in the document/Total number of words in the document).

For consideration, say we have the total number of words in the documents is 60, and the words “(the, cat, floccinaucinihilipilification)” occurs (12, 5, 1) times.

P(“the”) discarding probability

P("cat") discarding probability

P("floccinaucinihilipilification") discarding probability

So from the values above, we can wisely set, to sample the words which lie between the probability 0.80–0.90. See the plot for overall values.

3. Negative Sampling

In a real-time implementation where we deal with large text corpus, if we look at the output layer, say there are 10,000 one-hot labels encoded, we are changing the weights of those 10,000 labels by a very small amount even though we only have one true example (..0010000..), this makes the training very inefficient. So one workaround is to update only a small subset of weights, where we update the weights of the correct label and a small number of incorrect labels. This technique is called negative sampling and it has been used in the word2vec model while training.

8. Resources & References

I hope I was able to provide some visual understanding to our mini word embedding, let me know your thoughts in the comment section.

Check out the Notebooks that contains the entire code implementation and feel free to break it.

Complete Code Implementation is available at,

  1. Github
  2. Google Colab
  3. Kaggle

Until then, see you next time.