How does Word2Vec embed the word?

In last posts, we investigate the LSA and LDA to classify the documents. The output results are the embedding of terms and documents, with which we can classify terms and documents using cluster techniques. The LSA or LDA all belong to count-based methods of Vector Space Models. VSMs have a long, rich history in NLP, but all methods depend in some way or another on the Distributional Hypothesis, which states that words that appear in the same contexts share semantic meaning. This post introduces predictive methods of VSMs – Word2Vec. Predictive models directly try to predict a word from its neighbors in terms of learned small, dense embedding vectors. (vector represent of words)

One-hot Vector

For all sentences, We’d like to represent each word of specific sentence as a vector. The simplest way is to use the one-hot vector like $(1, 0, 0, 0, \ldots)^T$ . More specifically, for thousands of sentences, We tokenize the sentence into words. Put all words into a set, we get words vocabulary of sentences. Visualized as follows:

Note the dictionary of vocabulary has order, once set, the index of word does not change. If there is a sentence, like ‘I love salad’. The ‘I’ is at $2_{nd}$ place of vocabulary, so the initial embedding vector of ‘I’ is $(0, 1, 0, 0, \ldots)^T$ . Similarly, the embedding vector of ‘love’ is $(0, 0, 0, 1, 0, \ldots)^T$ and initial embedding of “salad” is $(0, 0, 0, 0, 0, 1 ,\ldots)^T$ . The length of the vector is the vocabulary size of queries. So the embedding of a sentence should be a $vocab\_size \times 3$ matrix,

$\begin{bmatrix} 0&0&0 \\ 1&0&0 \\ 0&0&0 \\0&1&0 \\0&0&0 \\ 0&0&1\\\vdots&\vdots&\vdots \end{bmatrix}$

The 3 in dimension $vocab\_size \times 3$ is the length of sentence.

In fact, when we construct the model with coding. We don’t need to translate all words into this one-hot vector. That costs time and memory. In fact, when a embedding matrix $\mathbf {M}$ with dimension $4 \times vocab\_size$ , like

$\begin{bmatrix} 0.01&0.02&0.03&0.04&0.05&\ldots \\0.04&0.05&0.06&0.07&0.08&\ldots \\0.07&0.08&0.09&0.10&0.11&\ldots \\0.12&0.13&0.14&0.15&0.16&\dots \end{bmatrix}$

right multiplied by the embedding of a word, like $\begin{bmatrix}0&1&0&0&\ldots\end{bmatrix}^T$ , we just get the $2_{nd}$ column of the matrix $\begin{bmatrix}0.02&0.05&0.08&0.13\end{bmatrix}^T$ . Thus, normally we just embed the sentence ‘I love salad’ as $\begin{bmatrix}2&4&6 \end{bmatrix}^T$ , which are the index of words in vocabulary. The 4 in dimension $4 \times vocab\_size$ is the hidden_size (how many nerons we use in neural network) that we will talk about later.

Word2Vec

Word2Vec is a shallow, two layers Neural Network that could analyze the raw text and produce a vector space of words. There are two architectures: continuous bag-of-words (CBOW) or continuous skip-gram. In the continuous bag-of-words architecture, the model predicts the current word from a window of surrounding context words, while in the continuous skip-gram architecture, the model uses the current word to predict the surrounding window of context words. Word2Vec_wiki. This distinction can be visualized as below:

Note, the skip-gram shown above is assumed to be with 0-skip-2-grams model, with only directly two neighbors considered. We could also produce 1-skip-2-grams, which would in addition include (“lunch”, “salad”), (“lunch”, “would”), (“as”, “I”), (“as”,”have”)…etc. besides the 0-skip pairs before. Word2Vec_tutorial

Now, imagine a model that we want to predict surrounding word given a center word. Like in sentence ‘I love salad’, for the word ‘I’, there could be pair (‘I’, ‘love’), (‘I’, ‘salad’). For all of thousands sentences, we may get ten thousands of pairs or more. In fact, for ‘I love salad’, we could produce 6 pairs, if we choose 0-skip-2-grams model. This can be treated as classification problem, with a word as input and vocabulary size target words as classes. The values in the output vector could represent the probability that the words are nearby the input word. In this model, We set one hidden layer, with 4 neurons in hidden layer (hidden size is 4). Larger hidden size is better, but I prefer 4 for post convenience.

There is a parameter matrix and activation function that map the inputs to the hidden states. Assume the embedding matrix $\mathbf {M}$ is as the one I give last module and the activation function $g(.)$ is softmax function, then the first neuron state is like:

$0.23761045 = g(0.02) = g(\begin{bmatrix} 0.01&0.02&0.03&0.04&0.05&\ldots \end{bmatrix} \times \begin{bmatrix}0\\1\\0\\0\\\vdots\end{bmatrix})$

$0.24484676 = g(0.05) = g(\begin{bmatrix} 0.04&0.05&0.06&0.07&0.08&\ldots \end{bmatrix} \times \begin{bmatrix}0\\1\\0\\0\\\vdots\end{bmatrix})$

…

Those calculations are the same from hidden layer to output layer. All the pairs are used to train the model to minimize the loss between output vectors and actural targets. Ultimately, we could get optimized embedding matrix $\mathbf {M}$ . Then using this embedding matrix $$\mathbf {M}$ , we can predict the words around a given center word. This output vector are in fact the embedding of this center word, since the output vector is unique for each word and could represent the semantic thing of the word.

This is Word2Vec, which can be used to embed words and represent the relationship among words. It is indeed a shallow neural network.

Implementation in Python and TensorFlow

- Batch data

This post, I write codes with Python and build the graph on TensorFlow. First, I construct a class- “batch.py”, where there is method to create generator of batched data. It will be more convenient for us to feed the data into tensorflow.

The corresponding github repo is here!!!

"""This file is called batch.py, which will be imported in word2vec.py"""

import numpy as np

class batch:
    
    def __init__(self, centers, targets, batch_size):
        
        self.centers = centers
        
        self.targets = targets
        
        self.batch_size = batch_size
      
    def next_(self, batch_group):

        batch_index = self.batch_size * batch_group
        
        s = batch_index % len(self.centers)
        
        e = (s + self.batch_size)%len(self.centers)
        
        if s < e :
        
            centers_bp = np.reshape(self.centers[s: e],[self.batch_size])
            
            targets_bp = np.reshape(self.targets[s: e], [self.batch_size, 1])
            
        else:
        
            centers_bp = np.reshape(self.centers[:e] + self.centers[s:],[self.batch_size])
            
            targets_bp = np.reshape(self.targets[:e] + self.targets[s:], [self.batch_size, 1])
            
        return centers_bp, targets_bp
    
if __name__ == '__main__':

    batch_ = batch()
    
    batch_.next_()

- Word2Vec

Now, we could focus on Word2Vec implementation. Note the ‘data’ in the inputs list of ‘word2vec’ function is a list of sentences, with each sentence represented as a list of words index, like [[1,2,3], [2,3,4], …]. This code refers to google tensorflow tutorial of Word2Vec. I have simplified the codes to the extent that I need. More specific, I split the batch function and generate skip pairs function into two class, which will be easier for me to understand. Strongly suggest you to read the official codes on your own, super helpful. TF-tutorial

The corresponding github repo is here!!!

# skip_window: represents how far you go from target word for labels at left or right side.
# num_skips: how many pairs you'd like produce for specific target word.
# batch_size: how many pairs are processed each step.
# vocab_size: remember we initial embed the word with one-hot vector, the dimension of the vector is the vocab_size.
# hidden_size: how many neurons we set up for each hidden layer
# n_epochs: how many loops we'd like to run.
# num_sampled: we use Noise-contrastive estimation to ease the computation.
# I am using TensorFlow 1.0 and Python 3.5

import numpy as np

import collections

import random

import math

import tensorflow as tf

from batch import batch  # remember the batch.py we define above?

# define word2vec function

def word2vec(data, num_skips, skip_window, 
             batch_size,  hidden_size, vocab_size,
             learning_rate, n_epochs, num_sampled):    

    # generate target-label pairs for specific sentence, 'data' -> 'one sentence'
    
    def generate_skip(data):
    
        data_index = 0
        
        size = 2*skip_window*len(data)
        
        assert num_skips <= 2 * skip_window
        
        centers = np.ndarray(shape=(size), dtype=np.int32)
        
        labels = np.ndarray(shape=(size, 1), dtype=np.int32)
        
        span = 2 * skip_window + 1  # [structure: skip_window, target , skip_window ]
        
        buffer = collections.deque(maxlen=span)
        
        for _ in range(span):
        
            buffer.append(data[data_index])
            
            data_index = (data_index + 1) % len(data)
            
        for i in range(size // num_skips):
        
            target = skip_window  # target label at the center of the buffer
            
            targets_to_avoid = [skip_window]
            
            for j in range(num_skips):
            
                while target in targets_to_avoid:
                
                    target = random.randint(0, span - 1)
                    
                targets_to_avoid.append(target)
                
                centers[i * num_skips + j] = buffer[skip_window]
                
                labels[i * num_skips + j, 0] = buffer[target]
                
            buffer.append(data[data_index])
            
            data_index = (data_index + 1) % len(data) # Move one forward
            
        return centers, labels


    # produce skip pairs for each sentence then concatenate   
    
    centers = []
    
    targets = []
    
    for sequence in data:
    
        centers_element, targets_element = generate_skip(sequence)
        
        centers.append(centers_element)
        
        targets.append(targets_element)
        
    centers = list(np.concatenate(centers))
    
    targets = list(np.concatenate(targets))
    
    """Tensorflow Infrastructure"""
    
    # Step 1: define the placeholders for input and output 
            
    with tf.name_scope("batch_data"):
    
        center_words = tf.placeholder(tf.int32, shape=[batch_size], name='center_words')
        
        target_words = tf.placeholder(tf.int32, shape=[batch_size, 1], name='target_words')
    
    # Step 2: define weights. In word2vec, embed matrix is the weight matrix

    with tf.name_scope("embed"):
    
        embed_matrix = tf.Variable(tf.random_uniform([vocab_size, hidden_size], -1.0, 1.0), name = "embed_matrix")
    
    # Step 3+4: define inference + loss function

    with tf.name_scope("loss"):
        
        embed = tf.nn.embedding_lookup(embed_matrix, center_words, name = "embed")
        
        nce_weight = tf.Variable(tf.truncated_normal([vocab_size, hidden_size], 
                                                 stddev = 1.0/math.sqrt(hidden_size)), name = "nce_weight" )
        
        nce_bias = tf.Variable(tf.zeros([vocab_size]), name = "bias")
        
        loss = tf.reduce_mean(tf.nn.nce_loss(weights = nce_weight, biases = nce_bias, 
                                         labels = target_words, inputs=embed, 
                                         num_sampled = num_sampled, num_classes = vocab_size, name = "loss"))
    
    # Step 5: define optimizer

    with tf.name_scope("optimizer"):
        
        optimizer = tf.train.AdagradOptimizer(learning_rate).minimize(loss)


    
    # Step 6: run the graph

    with tf.Session() as sess:

        sess.run(tf.global_variables_initializer())
        
        current_epoch = 0
        
        print('Initialized, now begin Word2Vec!')
        
        writer = tf.summary.FileWriter('./my_graph', sess.graph)
        
        batch_ = batch(centers, targets, batch_size)
        
        for current_epoch in range(n_epochs):
        
            centers_batch, targets_batch = batch_.next_(current_epoch)
            
            feed = {center_words: centers_batch, target_words: targets_batch}
            
            loss_, _ = sess.run([loss, optimizer], feed_dict = feed)
            
            if (current_epoch+1) % 10000 == 0:
            
                print('The loss for {} iteration is {}'.format(current_epoch, loss_))
                
            current_epoch += 1
            
        embedding = sess.run(embed_matrix, feed_dict = feed)    
        
        writer.close()
        
    return embedding

Now we have the embed_matrix, $\mathbf{M}$ , you can just multiply the one-hot vector of the word to get the meaning embedding of the word.

For example, for word “I” in the $2_{nd}$ place, its final embedding is just $\mathbf{M} \times (1, 0, 0, 0, \ldots)^T$ , which is just the first column of $\mathbf{M}$ as we talk above.

Alright, that’s Word2Vec. Quite fun and great training for starting tensorflow!

Reference

Noise-contrastive estimation: http://proceedings.mlr.press/v9/gutmann10a/gutmann10a.pdf
Word2Vec TensorFlow Tutorial: https://github.com/tensorflow/tensorflow/tree/r1.2/tensorflow/examples/tutorials/word2vec

How does Word2Vec embed the word?

Dig out words relationships using Word2Vec

How does Word2Vec embed the word?

Dig out words relationships using Word2Vec

One-hot Vector

Word2Vec

Implementation in Python and TensorFlow

- Batch data

- Word2Vec

Reference