Loading...

Word2Vec 复现

2024-12-14 21:01

一、任务

原论文 Distributed Representations of Words and Phrases and their Compositionality (2013)、Efficient Estimation of Word Representations in Vector Space (2013) 链接。我们的目标是根据单词及其左右的上下文单词，训练一个词嵌入矩阵。【请注意，Word2Vec 的核心思想不在于神经网络架构多么优秀，而在于词嵌入矩阵很好，词嵌入矩阵这个概念可参考上一篇文章 NNLM】

Word2Vec 有两种训练模式：完形填空式、发散联想式，分别对应学术名 CBOW、Skip-gram。

Word2Vec 两种模式

在 CBOW 中，“?”称为中心词，“?”两边的单词称为左右上下文单词，是根据上下文预测中心词。而在 Skip-gram 中，“mom” 称为中心词，“mom”两侧的“?”称为上下文单词，是根据中心词预测上下文。

定义滑动窗口尺寸：从中心词开始向左右扩展的大小，扩展不许越界。记作 window_size。譬如，在 CBOW 的例子中，中心词为“?”时，当 window_size = 1，则窗口内包含“and ? got”这 3 个单词。

换句话说，如果 window_size 想要为 4，则中心词至少要从“created”单词开始，且中心词不能抵达“heavens”单词，否则滑动窗口右边越界

记词嵌入矩阵为 C 且随机初始化，每个单词嵌入的维度为 m 个，一共有 n 个不同的单词。接下来，我从 Skip-gram 和 CBOW 两种模式训练这个 C 矩阵。

$C=\left[\begin{matrix} & {\rm dim}_1 & {\rm dim}_2 & \cdots & {\rm dim}_m \\ {\rm word}_1 & 0.28376 & 0.2718 & \cdots & 0.1234 \\ {\rm word}_2 & 0.9813 & 0.14711 & \cdots & -0.783 \\ {\rm word}_3 & 0.0814 & -0.456912 & \cdots & 0.033 \\ {\rm word}_4 & -0.415 & 0.783 & \cdots & -0.84512 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ {\rm word}_{n} & 0.9876 & 0.7326 & \cdots & -0.496 \\ \end{matrix}\right]_{n\times m}$

二、Skip-gram 训练模式

假设有这么一句字符串 s，内容是首都与国家。

s = "Athens Greece Beijing China Athens Greece Berlin Germany"

我们令词向量维度 m = 2，且滑动窗口尺寸 window_size = 1，按照字母顺序制作一张词汇索引表，则 n = 6。这样我们最终会得到一个 6 x 2 的词嵌入矩阵 C。

单词	Athens	Beijing	Berlin	China	Germany	Greece
索引	0	1	2	3	4	5

对词嵌入矩阵进行训练，理想状况是：一个训练非常好的词嵌入矩阵，那么 Athens 这个词向量到 Greece 这个词向量，应该等价于 Beijing 这个词向量到 China 这个词向量，也等价于 Berlin 词向量到 Germany 词向量。

$C=\left[\begin{matrix} & {\rm dim}_1 & {\rm dim}_2 \\ {\rm Athens} & c_{11} & c_{12} \\ {\rm Beijing} & c_{21} & c_{22} \\ {\rm Berlin} & c_{31} & c_{32} \\ {\rm China} & c_{41} & c_{42} \\ {\rm Germany} & c_{51} & c_{52} \\ {\rm Greece} & c_{61} & c_{62} \\ \end{matrix}\right]_{6\times 2}$

$<[c_{11},c_{12}],\ [c_{61},c_{62}]>\Longleftrightarrow<[c_{21},c_{22}],\ [c_{41},c_{42}]>\Longleftrightarrow<[c_{31},c_{32}],\ [c_{51},c_{52}]>$

由于 window_size = 1，我们沿着字符串 s 方向，每次向右移动一个单词窗口，可以得到以下 6 个子串，其中红色标记为每次滑动窗口的中心词。再根据中心词配对上下文，可以得到 12 个元组。

"Athens Greece Beijing" ==> (Greece, Athens), (Greece, Beijing)
"Greece Beijing China" ==> (Beijing, Greece), (Beijing, China)
"Beijing China Athens" ==> (China, Beijing), (China, Athens)
"China Athens Greece" ==> (Athens, China), (Athens, Greece)
"Athens Greece Berlin" ==> (Greece, Athens), (Greece, Berlin)
"Greece Berlin Germany" ==> (Berlin, Greece), (Berlin, Germany)

到这里，我们可以观察一下中心词 "Greece"，在 12 个元组中，希腊匹配 2 次雅典，与北京和德国各匹配 1 次。如果不考虑匹配先后顺序，希腊和雅典匹配了 3 次，希腊和北京匹配了 2 次，希腊和柏林匹配了 2 次。

言外之意，如果字符串 s 更长，囊括了人类所有自然语言，那么希腊与雅典匹配的概率应该更大。这正是 Word2Vec 训练词嵌入矩阵的核心思想——概率。

同样的，如果 window_size = 2，每次向右移动一个单词窗口，则能得到 4 个子串，以及 16 个中心词与上下文的匹配元组。

"Athens Greece Beijing China Athens" ==> (Beijing, Athens), (Beijing, Greece), (Beijing, China), (Beijing, Athens)
"Greece Beijing China Athens Greece" ==> (China, Greece), (China, Beijing), (China, Athens), (China, Greece)
"Beijing China Athens Greece Berlin" ==> (Athens, Beijing), (Athens, China), (Athens, Greece), (Athens, Berlin)
"China Athens Greece Berlin Germany" ==> (Greece, China), (Greece, Athens), (Greece, Berlin), (Greece, Germany)

方便且见，以 window_size = 1 讲解，将这 12 个元组的单词用词汇索引表的索引替换，进行数值化，再放到一个大列表 L 中。

L = [(5, 0), (5, 1), (1, 5), (1, 3), (3, 1), (3, 0), (0, 3), (0, 5), (5, 0), (5, 2), (2, 5), (2, 4)]

L 列表每个元素是一个元组，元组的第一个元素是中心词索引，第二个元素是上下文单词索引。我们将列表 L 中每个元组的第一个元素抽出来，进行 One-Hot 编码后组成一个矩阵，记为 x，矩阵列数就是 n = 6。

$[5,5,1,1,3,3,0,0,5,5,2,2]_{1\times12}\implies x=\left[\begin{matrix} 0 & 0 & 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 & 0 & 1 \\ 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 \\ 1 & 0 & 0 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 & 0 & 1 \\ 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 \\ \end{matrix}\right]_{12\times6}$

再将列表 L 中每个元组的第二个元素抽出来，组成一个向量，记作 target。

${\rm target}=[0,1,5,3,1,0,3,5,0,2,5,4]_{1\times12}$

在真实训练任务中，元组的个数肯定远远超过 12 个，都是成千上万的元组，而且词汇索引表 n 也远不止 6 个，所以 x 矩阵和 target 向量会超级庞大。CPU 或者 GPU 处理器无法一次性加载庞大的 x 或 target，这个时候建议分批次训练，譬如，从所有元组中每次取不同的 batch_size 个元组进行训练。

Word2Vec 并不在意神经网络模型架构，只关注于一个良好的词嵌入矩阵 C，因此这里我搭建一个非常简单的 BP 神经网络，用于学习 C 矩阵。

import torch

class SkipGram(torch.nn.Module):
    def __init__(self, n: int, m: int, *args, **kwargs) -> None:
        """
        Args:
            - n 是词汇索引表大小
            - m 是词向量维度, 即一个单词具备的属性个数
        """
        super().__init__(*args, **kwargs)
        self.C = torch.nn.Parameter(torch.rand(n, m))
        self.linear1 = torch.nn.Linear(m, 3 * m, bias=False)
        self.linear2 = torch.nn.Linear(3 * m, m ** 2, bias=False)
        self.linear3 = torch.nn.Linear(m ** 2, n, bias=False)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # One-Hot 编码矩阵 x 是 Tensor 类型, 即浮点型, 参数矩阵 self.C 和 x 矩阵相乘
        return self.linear3(self.linear2(self.linear1(torch.matmul(x, self.C))))

if __name__ == '__main__':
    n = 6   # 词汇索引表大小
    m = 2   # 单词表示所需要的维度
    epoch = 10000

    x = torch.Tensor([
        [0, 0, 0, 0, 0, 1],
        [0, 0, 0, 0, 0, 1],
        [0, 1, 0, 0, 0, 0],
        [0, 1, 0, 0, 0, 0],
        [0, 0, 0, 1, 0, 0],
        [0, 0, 0, 1, 0, 0],
        [1, 0, 0, 0, 0, 0],
        [1, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 1],
        [0, 0, 0, 0, 0, 1],
        [0, 0, 1, 0, 0, 0],
        [0, 0, 1, 0, 0, 0]
    ])
    target = torch.LongTensor([0, 1, 5, 3, 1, 0, 3, 5, 0, 2, 5, 4])

    model = SkipGram(n=n, m=m)
    criterion = torch.nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

    for _ in range(0, epoch, 1):
        optimizer.zero_grad()
        y = model.forward(x)
        # 通过中心词预测上下文单词后, 与 target 上下文真值对比, 计算损失, 不断迭代训练嵌入矩阵 C.
        loss = criterion.forward(y, target)
        if _ % 25 == 0:
            print(f'[{_ + 25} / {epoch}]\tloss = {loss:.7f}')
        loss.backward()
        optimizer.step()

    C, linear1, linear2, linear3 = model.parameters()
    C = C.tolist()
    print(f'\nC = {C}')

另一种 PyTorch 写法是词嵌入形式，可参考上一篇文章 NNLM 的单词特征层激活向量，不显式进行 One-Hot 编码。

import torch

class SkipGram(torch.nn.Module):
    def __init__(self, n: int, m: int, *args, **kwargs) -> None:
        """
        Args:
            - n 是词汇索引表大小
            - m 是词向量维度, 即一个单词具备的属性个数
        """
        super().__init__(*args, **kwargs)
        self.C = torch.nn.Embedding(n, m)
        self.linear1 = torch.nn.Linear(m, 3 * m, bias=False)
        self.linear2 = torch.nn.Linear(3 * m, m ** 2, bias=False)
        self.linear3 = torch.nn.Linear(m ** 2, n, bias=False)

    def forward(self, x: torch.LongTensor) -> torch.Tensor:
        # 不显式 One-Hot 编码, x 张量是 LongTensor 类型, 即整型, 再直接传入成员函数 self.C( )
        return self.linear3(self.linear2(self.linear1(self.C(x))))

if __name__ == '__main__':
    n = 6   # 词汇索引表大小
    m = 2   # 单词表示所需要的维度
    epoch = 10000

    x = torch.LongTensor([5, 5, 1, 1, 3, 3, 0, 0, 5, 5, 2, 2])
    target = torch.LongTensor([0, 1, 5, 3, 1, 0, 3, 5, 0, 2, 5, 4])

    model = SkipGram(n=n, m=m)
    criterion = torch.nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

    for _ in range(0, epoch, 1):
        optimizer.zero_grad()
        y = model.forward(x)
        # 通过中心词预测上下文单词后, 与 target 上下文真值对比, 计算损失, 不断迭代训练嵌入矩阵 C.
        loss = criterion.forward(y, target)
        if _ % 25 == 0:
            print(f'[{_ + 25} / {epoch}]\tloss = {loss:.7f}')
        loss.backward()
        optimizer.step()

    C, linear1, linear2, linear3 = model.parameters()
    C = C.tolist()
    print(f'\nC = {C}')

以上两种写法都是可行的，我们最终得到词嵌入矩阵 C 如下：

此时 loss 基本收敛不再变化，训练的 C 矩阵达到稳态

因为词向量维度 m = 2，所以我们将训练好的词嵌入矩阵 C 按照 6 个词向量顺序绘制在 2D 坐标系中。

import matplotlib.pyplot as plt

plt.scatter(C[0][0], C[0][1], marker=',', s=100, label='Athens')
plt.scatter(C[1][0], C[1][1], marker='o', s=100, label='Beijing')
plt.scatter(C[2][0], C[2][1], marker='x', s=100, label='Berlin')
plt.scatter(C[3][0], C[3][1], marker='*', s=100, label='China')
plt.scatter(C[4][0], C[4][1], marker='D', s=100, label='Germany')
plt.scatter(C[5][0], C[5][1], marker='>', s=100, label='Greece')

plt.plot([C[1][0], C[3][0]], [C[1][1], C[3][1]], linestyle='--')
plt.plot([C[0][0], C[5][0]], [C[0][1], C[5][1]], linestyle='-.')
plt.plot([C[2][0], C[4][0]], [C[2][1], C[4][1]], linestyle='-')

plt.legend()
plt.show()

希腊到雅典 ≈ 中国到北京 ≈ 德国到柏林

观察图像，各个首都与国家的词向量距离都差不多，所以此次训练的词嵌入矩阵是良好的，可用于下游任务。须注意，也许你复制我的代码，然后训练效果很差，这是很正常的，毕竟才 12 个匹配元组。

三、CBOW 训练模式

相比于跳字模型由中心词预测滑动窗口上下文单词，连续词袋模型完全相反，它类似英语作业中的完形填空，CBOW 模式是根据滑动窗口上下文单词预测中心词。

依然是首都与国家的例子：

s = "Athens Greece Beijing China Athens Greece Berlin Germany"

这一次我们令 window_size = 2，每次向右移动一个单词窗口，则能得到 4 个子串，以及 4 个上下文词组与中心词的匹配大号元组。

"Athens Greece Beijing China Athens" ==> ([Athens, Greece, China, Athens], Beijing)
"Greece Beijing China Athens Greece" ==> ([Greece, Beijing, Athens, Greece], China)
"Beijing China Athens Greece Berlin" ==> ([Beijing, China, Greece, Berlin], Athens)
"China Athens Greece Berlin Germany" ==> ([China, Athens, Berlin, Germany], Greece)

Skip-gram 是根据元组中红色位置的中心词预测黑色位置的上下文单词，CBOW 是根据黑色位置的上下文词组预测红色位置的中心词。按照词汇索引表对元组进行数值化，得到大列表 L 如下：

L = [([0, 5, 3, 0], 1), ([5, 1, 0, 5], 3), ([1, 3, 5, 2], 0), ([3, 0, 2, 4], 5)]

Skip-gram 模式搭建了一个 BP 神经网络用于训练词嵌入矩阵，CBOW 模式我们尝试点新颖的，参考 PyTorch 官方教程，设计如下架构：

import torch

class CBOW(torch.nn.Module):
    def __init__(self, n: int, m: int, window_size: int, *args, **kwargs) -> None:
        """
        Args:
            - n 是词汇索引表大小
            - m 是词向量维度
        """
        super().__init__(*args, **kwargs)
        self.C = torch.nn.Embedding(n, m)
        self.hidden_layer = torch.nn.Linear(m * window_size * 2, 128, bias=False)
        self.output_layer = torch.nn.Linear(128, n, bias=False)

    def forward(self, x: torch.LongTensor) -> torch.Tensor:
        """
        参考 PyTorch 官网设计: https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html#getting-dense-word-embeddings
        """
        y = self.C(x).view(1, -1)
        y = torch.nn.functional.relu(self.hidden_layer(y))
        y = torch.nn.functional.log_softmax(self.output_layer(y), dim=1)
        return y

接下来，Word2Vec 最核心的步骤就是训练词嵌入矩阵 C：

import matplotlib.pyplot as plt

n = 6
m = 2
window_size = 2
epoch = 10000

L = [
    ([0, 5, 3, 0], 1),
    ([5, 1, 0, 5], 3),
    ([1, 3, 5, 2], 0),
    ([3, 0, 2, 4], 5)
]

model = CBOW(n, m, window_size)
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

for _ in range(0, epoch, 1):
    loss = 0
    for x, target in L:
        x = torch.LongTensor(x)
        target = torch.LongTensor([target])
        y = model.forward(x)
        # 通过上下文单词预测中心词, 再与 target 中心词真值对比, 计算损失, 不断迭代训练嵌入矩阵 C.
        loss = loss + criterion.forward(y, target)
    if _ % 25 == 0:
        print(f'[{_ + 25} / {epoch}]\tloss = {loss:.7f}')
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

C, hidden, output = model.parameters()
C = C.tolist()
print(f'\nC = {C}')

plt.scatter(C[0][0], C[0][1], marker=',', s=100, label='Athens')
plt.scatter(C[1][0], C[1][1], marker='o', s=100, label='Beijing')
plt.scatter(C[2][0], C[2][1], marker='x', s=100, label='Berlin')
plt.scatter(C[3][0], C[3][1], marker='*', s=100, label='China')
plt.scatter(C[4][0], C[4][1], marker='D', s=100, label='Germany')
plt.scatter(C[5][0], C[5][1], marker='>', s=100, label='Greece')

plt.plot([C[1][0], C[3][0]], [C[1][1], C[3][1]], linestyle='--')
plt.plot([C[0][0], C[5][0]], [C[0][1], C[5][1]], linestyle='-.')
plt.plot([C[2][0], C[4][0]], [C[2][1], C[4][1]], linestyle='-')

plt.legend()
plt.show()

这是我测试好几次得到的 CBOW 词嵌入矩阵

观察图像，词向量之间距离依然有“希腊到雅典 ≈ 中国到北京 ≈ 德国到柏林”的规律，也就是说 C 矩阵是理想的。

四、采用 Skip-gram 模式在 Word2Vec questions-words 数据集上训练词嵌入矩阵

点击此处下载项目代码【72 KB】

首都国家数据集训练后预测全对！完美！

观察词向量之间距离，非常完美，这个词嵌入矩阵训练得简直不要太棒！

五、一些思考

这篇文章似乎在寻找神经网络中的不变量？以往的认知都是提出一个全新的架构，在某个数据集上超越旧的模型，但 Word2Vec 另辟蹊径，它在寻找一种秩序，一个训练良好的不变量。难怪作者能发顶会，羡慕！ ξ( ✿＞◡❛)

可视化词嵌入：https://ronxin.github.io/wevi