Loading...

NNLM 解读与复现

2024-12-10 13:44

一、任务

原论文 A Neural Probabilistic Language Model (2003) 链接。我们的目标是根据前 n - 1 个单词，预测接下来第 n 个单词。

以原论文 1.1 章节的例子说明，已知以下两句话，每句话都包含 7 个单词。

the cat is walking in the bedroom
a dog was running in a room

我们需要通过一个模型，训练上面的两句话，然后用模型预测这三个句子，其中每个句子包含 6 (7 - 1) 个单词。

the cat is running in a ____
a dog is walking in a ____
the dog was walking in the ____

当然，模型不能瞎预测，我们做些约定，来主观定义预测的好坏。

理想的预测 ✅

the cat is running in a room

the cat is running in a bedroom

a dog is walking in a room

a dog is walking in a bedroom

the dog was walking in the room

the dog was walking in the bedroom

垃圾的预测 ❌

the cat is running in a walking【预测结果词性错误，应该是名词】

a dog is walking in a cat【预测结果语义错误，什么叫狗在猫上散步？】

the dog was walking in the playground【预测结果未定义，是未知词汇】

the dog was walking in the 操场【预测结果未定义，是未知词汇】

二、算法解读

NNLM 神经网络架构注解

架构一共 3 层，非常经典，粉色框分别是特征层、隐藏层、输出层。一层层看，在特征层时，作者提出一个重要概念“being shared across all the words”。

The function $f$ is a composition of these two mapping ($C$ and $g$), with $C$ being shared across all the words in the context. With each of these two parts are associated some parameters. The parameters of the mapping $C$ are simply the feature vector themselves, represented by a $|V|\times m$ matrix $C$ whose row $i$ is the feature vector $C(i)$ for word $i$.

概括这段话，他想表达 $C$ 是一个被所有单词共享的矩阵。以 2024 年的视角看这篇我才刚出生的论文，后人将其提炼成“词嵌入矩阵”。

生活中的一个常见的例子是，小明，小红，小王等样本，具有身高、体重、血压等属性。类比单词，也有词性、词义等属性。我们先定义一个词汇索引表，限定模型只能使用表格中的单词，不会出现未定义的未知词汇（像前文的 playground、操场）。词汇索引表是一个字典，键值对具有唯一性。

单词	the	a	cat	bedroom	dog	room	is	running	walking	was	in
索引	0	1	2	3	4	5	6	7	8	9	10

记词汇索引表元素个数为 $|V|$，此处有 11 个。假设一个单词具有词性和词义两个属性，则记 $m=2$。那么词嵌入矩阵 $C$ 可以随机初始化：

$C=\left[\begin{matrix} & m_1 & m_2 \\ {\rm idx}_0 & -0.123 & 0.31415926 \\ {\rm idx}_1 & 0.28376 & 0.2718 \\ {\rm idx}_2 & 0.9813 & 0.14711 \\ {\rm idx}_3 & 0.0814 & -0.456912 \\ {\rm idx}_4 & -0.415 & 0.783 \\ \vdots & \vdots & \vdots \\ {\rm idx}_{10} & 0.9876 & 0.7326 \\ \end{matrix}\right]_{11\times2}$

譬如，idx = 2 代表 cat 这个单词，它的属性是 [0.9813, 0.14711] 向量，即 cat = [0.9813, 0.14711]，同理 idx = 4 是 dog，则 dog = [-0.415, 0.783]。

作者首先说，将输入的 n - 1 个单词索引转化为词向量，然后进行拼接，形成一个大的单词特征层激活向量（此处的激活向量从数学角度而言，也是一个矩阵）。其中，$w_{t-1}$ 表示倒数第 1 个 word，$C(w_{t-1})$ 表示矩阵中倒数第 1 个单词的属性向量。

$x=[C(w_{t-1}),\ C(w_{t-2}),\ \ldots,\ C(w_{t-n+1})]$

我们按照作者意思，对例子进行操作。

the cat is walking in the bedroom -> [0, 2, 6, 8, 10, 0, 3]
a dog was running in a room -> [1, 4, 9, 7, 10, 1, 5]

由于第 n 个是需要预测的，即 bedroom 和 room 的词汇索引删除，再把两个向量拼接起来，得到 $x$ 矩阵：

$x=\left[\begin{matrix} 0 & 2 & 6 & 8 & 10 & 0 \\ 1 & 4 & 9 & 7 & 10 & 1 \\ \end{matrix}\right]_{2\times6}$

我们顺便把删除的 bedroom 和 room 索引组合成 target 目标向量：

${\rm target}=[3,\ 5]_{1\times2}$

然后对单词索引 $x$ 矩阵进行属性的词“嵌入”：

$\left[\begin{matrix} [-0.123,\ 0.31415926] & [0.9813,\ 0.14711] & \cdots & [-0.123,\ 0.31415926] \\ [0.28376,\ 0.2718] & [-0.415,\ 0.783] & \cdots & [0.28376,\ 0.2718] \\ \end{matrix}\right]_{2\times6}$

这样的写法不是很像数学中矩阵分块的表示，所以干脆简化一下。

$x=\left[\begin{matrix} -0.123 & 0.31415926 & 0.9813 & 0.14711 & \cdots & -0.123 & 0.31415926 \\ 0.28376 & 0.2718 & -0.415 & 0.783 & \cdots & 0.28376 & 0.2718 \\ \end{matrix}\right]_{2\times12}$

这里得到的 $x$ 矩阵，就是作者所说的“word features layer activation vector”单词特征层激活向量。

继续一层层看，到隐藏层和输出层时，作者做了一个双曲正切非线性变换，然后将变换结果进行线性变换。作者用绿色虚线表示 $x$ 矩阵数据需要追加到输出层，即 $Wx$ 数据，作者发文的 12 年后，大名鼎鼎的残差神经网络横空出世，两者的核心思想竟如出一辙？

$y=b+Wx+U\tanh(d+Hx)$

接着作者解释线性变换的这些系数矩阵相应的参数。

所有参数的解释

h：隐藏层神经元的个数
m：单词拥有属性的个数，在上面我举的例子中，m = 2
W：特征输入层到输出层的权重，$|V|\times[(n-1)m]$ 的矩阵
b：输出层偏置，$|V|$ 个元素的向量
d：隐藏层偏置，向量元素个数与隐藏层神经元个数相同
U：隐藏层到输出层权重，$|V|\times h$ 的矩阵
H：隐藏层权重，$h\times(n-1)$ 的矩阵
C：词嵌入矩阵，$|V|\times m$ 的矩阵，上述例子中，即 11 x 2 的矩阵

作者说这所有的参数构成 $\theta$ 参数空间，经过学习率 $\epsilon$ 下的迭代，找到偏导函数最优解。当然，现在是 2024 年，PyTorch 早就写好优化器了：)

三、PyTorch 简单复现 NNLM 过程

import torch

class NNLM(torch.nn.Module):
    def __init__(self, V: int, m: int, h: int, n: int, *args, **kwargs) -> None:
        """
        V: 词汇字典所含元素个数.
        m: 用向量表达单词, 向量的维度.
        h: 隐藏层神经元个数.
        n: 已知前 n-1 个单词, 推理接下来第 n 个单词, 论文中 n 的最小索引从 1 开始.
        """
        super().__init__(*args, **kwargs)
        self.b = torch.nn.Parameter(torch.randn(V))   # 偏置 b 是模型输出层的自由参数, 标准正态.
        self.d = torch.nn.Parameter(torch.randn(h))   # 隐藏层偏置 d 有 h 个元素, 标准正态.
        self.W = torch.nn.Parameter(torch.rand(V, (n-1)*m))  # 单词特征层到输出层权重, 均匀分布.
        self.U = torch.nn.Parameter(torch.rand(V, h))    # 隐藏层到输出层权重, 均匀分布.
        self.H = torch.nn.Parameter(torch.rand(h, (n-1)*m))  # 隐藏层权重, 均匀分布.
        self.C = torch.nn.Embedding(num_embeddings=V, embedding_dim=m)   # 词嵌入矩阵.
        # θ = (b, d, W, U, H, C)
        self.__dict__.update(**locals())

    def forward(self, x: torch.LongTensor) -> torch.Tensor:
        # x = (C(w_{t-1}), C(w_{t-2}), ..., C(w_{t-n+1}))
        x = self.C(x)
        x = x.view(-1, (self.n-1)*self.m)
        # y = b + Wx + Utanh(d + Hx)
        y = self.b + torch.matmul(x, self.W.T) + torch.matmul(torch.tanh(self.d + torch.matmul(x, self.H.T)), self.U.T)
        # return torch.nn.functional.softmax(y)
        return y

if __name__ == '__main__':
    vocabulary_index = {'the': 0, 'a': 1, 'cat': 2, 'bedroom': 3, 'dog': 4, 'room': 5, 'is': 6, 'running': 7, 'walking': 8, 'was': 9, 'in': 10}
    index_vocabulary = {0: 'the', 1: 'a', 2: 'cat', 3: 'bedroom', 4: 'dog', 5: 'room', 6: 'is', 7: 'running', 8: 'walking', 9: 'was', 10: 'in'}

    # ['the cat is walking in the bedroom', 'a dog was running in a room']
    sentences_x = [
        ['the', 'cat', 'is', 'walking', 'in', 'the'],
        ['a', 'dog', 'was', 'running', 'in', 'a']
    ]
    sentences_target = ['bedroom', 'room']

    model = NNLM(V=len(vocabulary_index), m=2, h=12, n=7)
    criterion = torch.nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

    x = torch.LongTensor([
        [0, 2, 6, 8, 10, 0],
        [1, 4, 9, 7, 10, 1]
    ])
    target = torch.LongTensor([3, 5])

    # 举论文 "A Neural Probabilistic Language Model (2003)" 的 1.1 章节例子.
    unknown = [
        ['the', 'cat', 'is', 'running', 'in', 'a'],
        ['a', 'dog', 'is', 'walking', 'in', 'a'],
        ['the', 'dog', 'was', 'walking', 'in', 'the']
    ]
    unknown = torch.LongTensor([[vocabulary_index.get(j) for j in i] for i in unknown])

    # 训练次数低, 未知 unknown 测试集大概率预测错误.
    epoch = 50
    for _ in range(0, epoch, 1):
        optimizer.zero_grad()
        y = model.forward(x)
        loss = criterion(y, target)
        if _ % 25 == 0:
            print(f'[{_ + 25} / {epoch}]\tloss = {loss:.7f}')
        loss.backward()
        optimizer.step()
    print('-' * 50)
    y: torch.Tensor = model(unknown)
    print(y)
    print('-' * 50)
    predict = y.max(dim=1, keepdim=True)[1].squeeze()
    print(predict, ' -> ', list(index_vocabulary.get(int(idx)) for idx in predict))

    input('\n按 Enter 回车键继续...')

    # 训练次数多, 未知 unknown 测试集基本正确.
    epoch = 5000
    for _ in range(0, epoch, 1):
        optimizer.zero_grad()
        y = model.forward(x)
        loss = criterion(y, target)
        if _ % 25 == 0:
            print(f'[{_ + 25} / {epoch}]\tloss = {loss:.7f}')
        loss.backward()
        optimizer.step()
    print('-' * 50)
    y: torch.Tensor = model(unknown)
    print(y)
    print('-' * 50)
    predict = y.max(dim=1, keepdim=True)[1].squeeze()
    print(predict, ' -> ', list(index_vocabulary.get(int(idx)) for idx in predict))

训练 50 次，预测结果是 ❌

the cat is running in a is

a dog is walking in a is

the dog was walking in the walking

训练 5000 次，预测结果是 ✅

the cat is running in a bedroom

a dog is walking in a room

the dog was walking in the bedroom

四、在简化版的 Brown Corpus 数据集上训练并预测

作者在 Brown Corpus 数据集上做了模型的 CPU 训练，非常耗时，为此，我简化一下数据集，加之如果你的电脑有 GPU 图像处理器（至少 4GB 显存，否则可能报错 CUDA is out of memory，修改 device = 'cpu'），推理速度会更快。

点击此处下载项目压缩包【400 KB】

简化版的数据集每句话单词数量介于 10～15 个之间，我们假定已知前 7 个单词，预测第 8 个单词，即 n = 8，其中每句话的标定符号全部忽略，并且所有单词采取小写格式，统计得到 19355 个单词。

为单词集合到每个元素编号，形成词汇索引字典，再按照训练集是测试集的 99 倍划分数据集。训练 99% 的数据集，然后用模型预测剩下 1% 的测试集，最后得到图示结果，中括号里绿色是原本单词，黄色是预测单词。

似乎效果并不是很好，不过预测单词的词性与真值比较相近

综上所述，这是一篇思路出色的论文，虽然作者最终的训练效果不是很好。

五、一些思考

用数学矩阵的思想理解词嵌入

如果 One-Hot 编码的矩阵非常大，词嵌入矩阵可以起到降维作用，不管时间复杂度还是空间复杂度都会大大降低。

词嵌入矩阵看似混乱，但其中蕴含秩序。比如某个属性维度，那么该维度下的所有单词可能具备某种统计学分布。依然沿用前文的例子，小红、小明、小王，身高、体重等等，就单从身高维度来说，样本服从人群中的正态分布。因此，训练好的词嵌入矩阵看似一堆随机数，实则反应单词的某种秩序（规律），这种秩序在不同下游任务中，应该是统一的，即一个良好的词嵌入矩阵可以用在不同 NLP 任务中。