In last posts, we investigate the LSA and LDA to classify the documents. The output results are the embedding of terms and documents, with which we can classify terms and documents using cluster techniques. The LSA or LDA all belong to count-based methods of Vector Space Models. VSMs have a long, rich history in NLP, but all methods depend in some way or another on the Distributional Hypothesis, which states that words that appear in the same contexts share semantic meaning. This post introduces predictive methods of VSMs – Word2Vec. Predictive models directly try to predict a word from its neighbors in terms of learned small, dense embedding vectors. (vector represent of words)
One-hot Vector
For all sentences, We’d like to represent each word of specific sentence as a vector. The simplest way is to use the one-hot vector like . More specifically, for thousands of sentences, We tokenize the sentence into words. Put all words into a set, we get words vocabulary of sentences. Visualized as follows:
Note the dictionary of vocabulary has order, once set, the index of word does not change. If there is a sentence, like ‘I love salad’. The ‘I’ is at
place of vocabulary, so the initial embedding vector of ‘I’ is
. Similarly, the embedding vector of ‘love’ is
and initial embedding of “salad” is
. The length of the vector is the vocabulary size of queries. So the embedding of a sentence should be a
matrix,
The 3 in dimension is the length of sentence.
In fact, when we construct the model with coding. We don’t need to translate all words into this one-hot vector. That costs time and memory. In fact, when a embedding matrix with dimension
, like
right multiplied by the embedding of a word, like , we just get the
column of the matrix
. Thus, normally we just embed the sentence ‘I love salad’ as
, which are the index of words in vocabulary. The 4 in dimension
is the hidden_size (how many nerons we use in neural network) that we will talk about later.
Word2Vec
Word2Vec is a shallow, two layers Neural Network that could analyze the raw text and produce a vector space of words. There are two architectures: continuous bag-of-words (CBOW) or continuous skip-gram. In the continuous bag-of-words architecture, the model predicts the current word from a window of surrounding context words, while in the continuous skip-gram architecture, the model uses the current word to predict the surrounding window of context words. Word2Vec_wiki. This distinction can be visualized as below:
Note, the skip-gram shown above is assumed to be with 0-skip-2-grams model, with only directly two neighbors considered. We could also produce 1-skip-2-grams, which would in addition include (“lunch”, “salad”), (“lunch”, “would”), (“as”, “I”), (“as”,”have”)…etc. besides the 0-skip pairs before. Word2Vec_tutorial
Now, imagine a model that we want to predict surrounding word given a center word. Like in sentence ‘I love salad’, for the word ‘I’, there could be pair (‘I’, ‘love’), (‘I’, ‘salad’). For all of thousands sentences, we may get ten thousands of pairs or more. In fact, for ‘I love salad’, we could produce 6 pairs, if we choose 0-skip-2-grams model. This can be treated as classification problem, with a word as input and vocabulary size target words as classes. The values in the output vector could represent the probability that the words are nearby the input word. In this model, We set one hidden layer, with 4 neurons in hidden layer (hidden size is 4). Larger hidden size is better, but I prefer 4 for post convenience.
There is a parameter matrix and activation function that map the inputs to the hidden states. Assume the embedding matrix is as the one I give last module and the activation function
is softmax function, then the first neuron state is like:
…
Those calculations are the same from hidden layer to output layer. All the pairs are used to train the model to minimize the loss between output vectors and actural targets. Ultimately, we could get optimized embedding matrix . Then using this embedding matrix
, we can predict the words around a given center word. This output vector are in fact the embedding of this center word, since the output vector is unique for each word and could represent the semantic thing of the word.
This is Word2Vec, which can be used to embed words and represent the relationship among words. It is indeed a shallow neural network.
Implementation in Python and TensorFlow
- Batch data
This post, I write codes with Python and build the graph on TensorFlow. First, I construct a class- “batch.py”, where there is method to create generator of batched data. It will be more convenient for us to feed the data into tensorflow.
The corresponding github repo is here!!!
"""This file is called batch.py, which will be imported in word2vec.py"""
import numpy as np
class batch:
def __init__(self, centers, targets, batch_size):
self.centers = centers
self.targets = targets
self.batch_size = batch_size
def next_(self, batch_group):
batch_index = self.batch_size * batch_group
s = batch_index % len(self.centers)
e = (s + self.batch_size)%len(self.centers)
if s < e :
centers_bp = np.reshape(self.centers[s: e],[self.batch_size])
targets_bp = np.reshape(self.targets[s: e], [self.batch_size, 1])
else:
centers_bp = np.reshape(self.centers[:e] + self.centers[s:],[self.batch_size])
targets_bp = np.reshape(self.targets[:e] + self.targets[s:], [self.batch_size, 1])
return centers_bp, targets_bp
if __name__ == '__main__':
batch_ = batch()
batch_.next_()
- Word2Vec
Now, we could focus on Word2Vec implementation. Note the ‘data’ in the inputs list of ‘word2vec’ function is a list of sentences, with each sentence represented as a list of words index, like [[1,2,3], [2,3,4], …]. This code refers to google tensorflow tutorial of Word2Vec. I have simplified the codes to the extent that I need. More specific, I split the batch function and generate skip pairs function into two class, which will be easier for me to understand. Strongly suggest you to read the official codes on your own, super helpful. TF-tutorial
The corresponding github repo is here!!!
# skip_window: represents how far you go from target word for labels at left or right side.
# num_skips: how many pairs you'd like produce for specific target word.
# batch_size: how many pairs are processed each step.
# vocab_size: remember we initial embed the word with one-hot vector, the dimension of the vector is the vocab_size.
# hidden_size: how many neurons we set up for each hidden layer
# n_epochs: how many loops we'd like to run.
# num_sampled: we use Noise-contrastive estimation to ease the computation.
# I am using TensorFlow 1.0 and Python 3.5
import numpy as np
import collections
import random
import math
import tensorflow as tf
from batch import batch # remember the batch.py we define above?
# define word2vec function
def word2vec(data, num_skips, skip_window,
batch_size, hidden_size, vocab_size,
learning_rate, n_epochs, num_sampled):
# generate target-label pairs for specific sentence, 'data' -> 'one sentence'
def generate_skip(data):
data_index = 0
size = 2*skip_window*len(data)
assert num_skips <= 2 * skip_window
centers = np.ndarray(shape=(size), dtype=np.int32)
labels = np.ndarray(shape=(size, 1), dtype=np.int32)
span = 2 * skip_window + 1 # [structure: skip_window, target , skip_window ]
buffer = collections.deque(maxlen=span)
for _ in range(span):
buffer.append(data[data_index])
data_index = (data_index + 1) % len(data)
for i in range(size // num_skips):
target = skip_window # target label at the center of the buffer
targets_to_avoid = [skip_window]
for j in range(num_skips):
while target in targets_to_avoid:
target = random.randint(0, span - 1)
targets_to_avoid.append(target)
centers[i * num_skips + j] = buffer[skip_window]
labels[i * num_skips + j, 0] = buffer[target]
buffer.append(data[data_index])
data_index = (data_index + 1) % len(data) # Move one forward
return centers, labels
# produce skip pairs for each sentence then concatenate
centers = []
targets = []
for sequence in data:
centers_element, targets_element = generate_skip(sequence)
centers.append(centers_element)
targets.append(targets_element)
centers = list(np.concatenate(centers))
targets = list(np.concatenate(targets))
"""Tensorflow Infrastructure"""
# Step 1: define the placeholders for input and output
with tf.name_scope("batch_data"):
center_words = tf.placeholder(tf.int32, shape=[batch_size], name='center_words')
target_words = tf.placeholder(tf.int32, shape=[batch_size, 1], name='target_words')
# Step 2: define weights. In word2vec, embed matrix is the weight matrix
with tf.name_scope("embed"):
embed_matrix = tf.Variable(tf.random_uniform([vocab_size, hidden_size], -1.0, 1.0), name = "embed_matrix")
# Step 3+4: define inference + loss function
with tf.name_scope("loss"):
embed = tf.nn.embedding_lookup(embed_matrix, center_words, name = "embed")
nce_weight = tf.Variable(tf.truncated_normal([vocab_size, hidden_size],
stddev = 1.0/math.sqrt(hidden_size)), name = "nce_weight" )
nce_bias = tf.Variable(tf.zeros([vocab_size]), name = "bias")
loss = tf.reduce_mean(tf.nn.nce_loss(weights = nce_weight, biases = nce_bias,
labels = target_words, inputs=embed,
num_sampled = num_sampled, num_classes = vocab_size, name = "loss"))
# Step 5: define optimizer
with tf.name_scope("optimizer"):
optimizer = tf.train.AdagradOptimizer(learning_rate).minimize(loss)
# Step 6: run the graph
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
current_epoch = 0
print('Initialized, now begin Word2Vec!')
writer = tf.summary.FileWriter('./my_graph', sess.graph)
batch_ = batch(centers, targets, batch_size)
for current_epoch in range(n_epochs):
centers_batch, targets_batch = batch_.next_(current_epoch)
feed = {center_words: centers_batch, target_words: targets_batch}
loss_, _ = sess.run([loss, optimizer], feed_dict = feed)
if (current_epoch+1) % 10000 == 0:
print('The loss for {} iteration is {}'.format(current_epoch, loss_))
current_epoch += 1
embedding = sess.run(embed_matrix, feed_dict = feed)
writer.close()
return embedding
Now we have the embed_matrix, , you can just multiply the one-hot vector of the word to get the meaning embedding of the word.
For example, for word “I” in the place, its final embedding is just
, which is just the first column of
as we talk above.
Alright, that’s Word2Vec. Quite fun and great training for starting tensorflow!
Reference
- Noise-contrastive estimation: http://proceedings.mlr.press/v9/gutmann10a/gutmann10a.pdf
- Word2Vec TensorFlow Tutorial: https://github.com/tensorflow/tensorflow/tree/r1.2/tensorflow/examples/tutorials/word2vec